vene.ro · differentiablesparsestructuredprediction (niculaeandmartins,2020) tree budget budget...

Post on 27-Jan-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Learning with Sparse Latent StructureVlad Niculae

Instituto de Telecomunicações

Work with: André Martins, Claire Cardie, Mathieu Blondel

github.com/deep-spin/lp-sparsemap @vnfrombucharest https://vene.ro

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Rich Underlying Structure

title

author

date

body

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

2

Rich Underlying StructureWidely occuring pattern!

speech(Andre‐Obrecht, 1988)

objects(Long et al., 2015)

transition graphs(Kipf, Pol, et al., 2020)

But we’ll focus on NLP.

3

Rich Underlying StructureWidely occuring pattern!

speech(Andre‐Obrecht, 1988)

objects(Long et al., 2015)

transition graphs(Kipf, Pol, et al., 2020)

But we’ll focus on NLP.

3

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

4

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

⋆ dog on wheels

⋆ dog on wheels

· · ·

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

dogon

wheels

hondopwielen

4

Structured Prediction· · ·

· · ·5

Traditional Pipeline Approach

input

pretrained parser

output

positive

neutral

negative

6

Traditional Pipeline Approach

input

pretrained parser

output

positive

neutral

negative

6

Deep Learning & Hidden Representations

input

dense vector

output

positive

neutral

negative

7

Deep Learning & Hidden Representations

input

dense vector

output

positive

neutral

negative

7

Latent Structure Models

input

· · ·

· · ·

output

positive

neutral

negative

8

*record scratch*

*freeze frame*

How to select an itemfrom a set?

9

How to select an item from a set?

· · ·

10

How to select an item from a set?

c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11

How to select an item from a set?

c1c2

· · ·cN

θ

p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11

How to select an item from a set?

c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11

How to select an item from a set?

c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11

How to select an item from a set?

c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=?

or, essentially, ∂p∂θ=?

11

How to select an item from a set?

c1c2

· · ·cN

θ p

2

4

−1

1

−3

0

1

0

0

0

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

11

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmaxθ

c1c2

· · ·cN

θ pp

∂p∂θ=?

12

Argmax

c1c2

· · ·cN

θ p

∂p∂θ = 0

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

13

Argmax vs. Softmax

c1c2

· · ·cN

θ p

∂p∂θ = diag(p) − pp

p1

θ10

1

θ2 − 1 θ2 θ2 + 1

pj = exp(θj)/Z

14

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 3

15

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

N = 315

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

N = 315

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5

1

1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

p = [0,1,0]

p = [0,0,1]

p = [1/3, 1/3, 1/3]

N = 315

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5

N = 315

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

N = 315

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

N = 315

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5

1

1.5θ = [.2,1.4]

p⋆ = [0,1]

N = 2

0.5 11.5

0.5 1 1.5

0.5

1

1.5θ = [.7, .1,1.5]

p⋆ = [0,0,1]

N = 315

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

16

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

16

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

16

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22

α‐entmax: Ω(p)= 1/α(α−1)∑

j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

(Martins and Astudillo, 2016) 16

Smoothed Max OperatorsπΩ(θ) = argmax

p∈p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

Tsallis (1988); a generalized entropy (Grünwald and Dawid, 2004)(Blondel, Martins, and Niculae 2019a;Peters, Niculae, and Martins 2019;Correia, Niculae, and Martins 2019)

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

16

softmax

17

sparsemax

18

fusedmax ?!

19

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

20

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

20

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

21

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|

p1

θ10

1

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

21

Structured Predictionfinally

Structured Predictionis essentially a (very high‐dimensional) argmax

c1c2

· · ·cN

θ pinputx

outputyc2

There are exponentiallymany structures(θ cannot fit in memory!)

22

Structured Predictionis essentially a (very high‐dimensional) argmax

· · ·

θ pinputx

outputy

There are exponentiallymany structures(θ cannot fit in memory!)

22

Structured Predictionis essentially a (very high‐dimensional) argmax

· · ·

θ pinputx

outputy

There are exponentiallymany structures(θ cannot fit in memory!)

22

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

23

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

ay = [010 100 001]

23

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

TREE

ay = [010 100 001]

23

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

TREE

ay = [010 100 001]

23

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

*→

dog→

on→

wheels→

dog on wheels

TREE

ay = [010 100 001]

23

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

A =

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

dogon

wheels

hondopwielen

A =

dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1

wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1

η =

.1

.2−.1.3.8.1−.3.2−.1

23

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

M

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

(Niculae, Martins, Blondel, and Cardie, 2018)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

24

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM

• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM

• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη

Active Set achievesfinite & linear convergence!

Completely modular: just add MAP

25

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise

• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη

Active Set achievesfinite & linear convergence!

Completely modular: just add MAP

25

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

⊤dy

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

25

SparseMAP Applications

• Sparse alignment attention (more later)(Niculae, Martins, Blondel, and Cardie, 2018)

• Latent TreeLSTM(Niculae, Martins, and Cardie, 2018)

• As loss: supervised dependency parsing(Niculae, Martins, Blondel, and Cardie 2018;Blondel, Martins, and Niculae 2019b)

26

Latent Dependency TreesListOps (Nangia and Bowman, 2018)

Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)

(max 2 9 (min 4 7 ) 0 )

27

Latent Dependency TreesListOps (Nangia and Bowman, 2018)

Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)

(max 2 9 (min 4 7 ) 0 )4 - - 2 - - - - -

27

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch

validationF1score Gold tree

Left‐to‐right

28

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch

validationF1score Gold tree

Latent treeLeft‐to‐right

28

What if MAP is notavailable?

Multiple, Overlapping Factors

Maximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→

TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Multiple, Overlapping Factors

Maximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Multiple, Overlapping Factors

Maximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Multiple, Overlapping FactorsMaximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

*→

dog→

on→

wheels→TREE

BUDGET

BUDGET

BUDGET

BUDGET

29

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμ,μf

∑f∈F

η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμ,μf

∑f∈F

η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμ,μf

∑f∈F

η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

fa

fb

μa

μb

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμ,μf

∑f∈F

η⊤f μf− 1/2∥μ∥2 s.t. Cfμ =μf, μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

M

L

(Niculae and Martins, 2020)

30

Algorithms for LP-SparseMAP(Niculae and Martins, 2020)

Forward pass

argmaxCfμ=μf

∑f∈F

η⊤f μf− 1/2∥μ∥2

=argmaxCf μ=μf

∑f∈Fη⊤f μf − 1/2∥Dfμf∥2

• Separable objective,agreement constraintsADMM in consensus form

• SparseMAP subproblem for each f

Backward pass

• Sparse fixed‐point iteration

• Combines the SparseMAP Jacobiansof each factor

31

Algorithms for LP-SparseMAP(Niculae and Martins, 2020)

Forward pass

argmaxCfμ=μf

∑f∈F

η⊤f μf− 1/2∥μ∥2

=argmaxCf μ=μf

∑f∈Fη⊤f μf − 1/2∥Dfμf∥2

• Separable objective,agreement constraintsADMM in consensus form

• SparseMAP subproblem for each f

Backward pass

• Sparse fixed‐point iteration

• Combines the SparseMAP Jacobiansof each factor

31

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!

If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave

fg.add(Tree(var))

for i in range(n):fg.add(Budget(var[i, :], budget=5)

μ = fg.lp_sparsemap(η)

32

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.

Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave

fg.add(Tree(var))

for i in range(n):fg.add(Budget(var[i, :], budget=5)

μ = fg.lp_sparsemap(η)

32

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

class Factor:def map(ηf): # abstract, private

raise NotImplemented

def sparsemap(ηf):# active set algo, uses self.map

def backward(dμf):# analytic, uses active set result

class Budget(Factor):def sparsemap(ηf):

# specialized

def backward(dμf):# specialized

32

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

class Factor:def map(ηf): # abstract, private

raise NotImplemented

def sparsemap(ηf):# active set algo, uses self.map

def backward(dμf):# analytic, uses active set result

class Budget(Factor):def sparsemap(ηf):

# specialized

def backward(dμf):# specialized

class Tree(Factor):def map(η):

# Chu-Liu/Edmonds algo32

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch

validationF1score Gold tree

Latent treeLeft‐to‐right

33

0 20 40 60 80 100

0.2

0.4

0.6

0.8

1

epoch

validationF1score Gold tree

Latent w/ Budget(5)Latent treeLeft‐to‐right

33

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: decomposable attention (Parikh et al., 2016))34

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: decomposable attention (Parikh et al., 2016))34

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Model: decomposable attention (Parikh et al., 2016))34

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

input

(P, H)

A

gentleman

overlooking

...

situation

A

police

officer...

closely

output

entails

contradicts

neutral

(Proposed model: global structured alignment.)34

Structured Alignment Modelsmatching

SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)

LP‐matching

LP‐SparseMAP w/ XORs(equivalent; different solver)

LP‐sequence

additional scorefor contiguous alignments

(i, j) − (i + 1, j± 1)

35

Structured Alignment Modelsmatching

SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)

LP‐matching

LP‐SparseMAP w/ XORs(equivalent; different solver)

LP‐sequence

additional scorefor contiguous alignments

(i, j) − (i + 1, j± 1)

35

Structured Alignment Modelsmatching

SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)

LP‐matching

LP‐SparseMAP w/ XORs(equivalent; different solver)

LP‐sequence

additional scorefor contiguous alignments

(i, j) − (i + 1, j± 1)35

MultiNLI (Williams et al., 2017)

softmax matching LP‐matching LP‐sequence66%

68%

70%

72%

36

a

gentleman

overlooking

a

neighborhood

situation

.

a

police

officer

watches a

situation

closely . a

police

officer

watches a

situation

closely .

37

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

A

gentleman

overlooking

a

neighborhood

situation

.

A

police

officer

watches

a

situation

closely

.

38

ConclusionsDifferentiable & sparsestructured inference

Generic, extensible, efficient algorithms

Interpretable structured attention

Future workStructure beyond NLP

Weak & semi‐supervision

Generative latent structure models

vlad@vene.ro github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest

· · ·

· · ·

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

TREE

BUDGET

BUDGET

BUDGET

BUDGET

ConclusionsDifferentiable & sparsestructured inference

Generic, extensible, efficient algorithms

Interpretable structured attention

Future workStructure beyond NLP

Weak & semi‐supervision

Generative latent structure models

vlad@vene.ro github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest

· · ·

· · ·

a

police

officer

watches a

situation

closely .

a

gentleman

overlooking

a

neighborhood

situation

.

TREE

BUDGET

BUDGET

BUDGET

BUDGET

Extra slides

Acknowledgements

This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013.

Some icons by Dave Gandy and Freepik via flaticon.com.

40

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

41

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

41

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

41

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

41

Fusedmax

fusedmax(θ) = argmaxp∈

p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|

= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

Proposition: fusedmax(θ) = sparsemaxproxfused(θ)

(Niculae and Blondel, 2017)

“Fused Lasso” a.k.a. 1‐d Total Variation

(Tibshirani et al., 2005)

42

Fusedmax

fusedmax(θ) = argmaxp∈

p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|

= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

Proposition: fusedmax(θ) = sparsemaxproxfused(θ)

(Niculae and Blondel, 2017)

“Fused Lasso” a.k.a. 1‐d Total Variation

(Tibshirani et al., 2005)

42

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

43

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

43

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

t0

1

−1 0 +1maxj θj

t0

1

−1 0 +1g1 | g ∈ ∂maxj θj

43

Dynamically inferringthe computation graph

So far: a structured hidden layerEH[aH]

Network must handle “soft” combinations of structures.Fine for attention, but can be limiting.

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

45

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈H

The bears eat the pretty ones

output

y

(Niculae, Martins, and Cardie, 2018)

46

Latent Dependency TreeLSTM

input

x

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈HThe bears eat the pretty ones

output

y

(Niculae, Martins, and Cardie, 2018)

46

Structured Latent Variable Models

p(y | x) = ∑h∈H

p

φ

(y | h, x) p

π

(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by h

sum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by h

sum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ?

∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3

SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

Structured Latent Variable Models

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

47

SparseMAP

• • •= .7 • • • + .3 • • •

+ 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...

p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

48

Sentiment classification (SST)

80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Sentiment classification (SST)

LTR80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Sentiment classification (SST)

LTR Flat80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Sentiment classification (SST)

LTR Flat CoreNLP80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup

(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Sentiment classification (SST)

LTR Flat CoreNLP Latent80%

81%

82%

83%

84%

85%accuracy(binary)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

80.6%

80.8%

81%

81.2%

81.4%

81.6%

81.8%

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

(concepts)

LTR Flat Latent30%

32%

34%

36%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

⋆ The bears eat the pretty ones

Flat: bag‐of‐words–like

⋆ The bears eat the pretty ones

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

Syntax vs. Composition Order

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%

⋆ lovely and poignant .

· · ·

50

Syntax vs. Composition Orderp = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%

⋆ lovely and poignant .

· · · 50

Syntax vs. Composition Order

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%

⋆ lovely and poignant .

· · ·

p = 15.33%

⋆ a deep and meaningful film .1.0 1.0

1.01.0

1.0 1.0

p = 15.27%

⋆ a deep and meaningful film .

1.01.0

1.01.0

1.01.0

· · ·CoreNLP parse, p = 0%

⋆ a deep and meaningful film .

1.0 1.0

1.01.0

1.0

1.0

51

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2

cost‐SparseMAP LρA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)

52

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2cost‐SparseMAP LρA(η, μ) = max

μ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)

52

References IAmos, Brandon and J. Zico Kolter (2017). “OptNet: Differentiable optimization as a layer in neural networks”. In: Proc. of ICML.

Andre‐Obrecht, Regine (1988). “A new statistical approach for the automatic segmentation of continuous speech signals”. In: IEEETransactions on Acoustics, Speech, and Signal Processing 36.1, pp. 29–40.

Bertsekas, Dimitri P (1999). Nonlinear Programming. Athena Scientific Belmont.

Blondel, Mathieu, André FT Martins, and Vlad Niculae (2019a). “Learning classifiers with Fenchel‐Young losses: Generalizedentropies, margins, and algorithms”. In: Proc. of AISTATS.

— (2019b). “Learning with Fenchel‐Young Losses”. In: preprint arXiv:1901.02324.

Brucker, Peter (1984). “An O(n) algorithm for quadratic knapsack problems”. In: Operations Research Letters 3.3, pp. 163–166.

Condat, Laurent (2016). “Fast projection onto the simplex and the ℓ1 ball”. In:Mathematical Programming 158.1‐2, pp. 575–585.

Correia, Gonçalo M., Vlad Niculae, and André FT Martins (2019). “Adaptively Sparse Transformers”. In: Proc. EMNLP.

Corro, Caio and Ivan Titov (2019). “Learning latent trees with stochastic perturbations and differentiable dynamic programming”.In: Proc. of ACL.

Danskin, John M (1966). “The theory of max‐min, with applications”. In: SIAM Journal on Applied Mathematics 14.4, pp. 641–664.

53

References IIDantzig, George B, Alex Orden, and Philip Wolfe (1955). “The generalized simplex method for minimizing a linear form underlinear inequality restraints”. In: Pacific Journal of Mathematics 5.2, pp. 183–195.

Frank, Marguerite and Philip Wolfe (1956). “An algorithm for quadratic programming”. In: Nav. Res. Log. 3.1‐2, pp. 95–110.

Gould, Stephen et al. (2016). “On differentiating parameterized argmin and argmax problems with application to bi‐leveloptimization”. In: preprint arXiv:1607.05447.

Grünwald, Peter D and A Philip Dawid (2004). “Game theory, maximum entropy, minimum discrepancy and robust Bayesiandecision theory”. In: Annals of Statistics, pp. 1367–1433.

Held, Michael, Philip Wolfe, and Harlan P Crowder (1974). “Validation of subgradient optimization”. In:Mathematical Programming6.1, pp. 62–88.

Hill, Felix et al. (2016). “Learning to understand phrases by embedding the dictionary”. In: TACL 4.1, pp. 17–30.

Kim, Yoon et al. (2017). “Structured attention networks”. In: Proc. of ICLR.

Kipf, Thomas, Elise van der Pol, and Max Welling (2020). “Contrastive Learning of Structured World Models”. In: Proc. of ICLR.

Kipf, Thomas and Max Welling (2017). “Semi‐supervised classification with graph convolutional networks”. In: Proc. of ICLR.

54

References IIIKoo, Terry et al. (2007). “Structured prediction models via the matrix‐tree theorem”. In: Proc. of EMNLP.

Kuhn, Harold W (1955). “The Hungarian method for the assignment problem”. In: Nav. Res. Log. 2.1‐2, pp. 83–97.

Lacoste‐Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of Frank‐Wolfe optimization variants”. In: Proc.of NeurIPS.

Liu, Yang and Mirella Lapata (2018). “Learning structured text representations”. In: TACL 6, pp. 63–75.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell (2015). “Fully convolutional networks for semantic segmentation”. In: Proc. ofCVPR.

Martins, André FT and Ramón Fernandez Astudillo (2016). “From softmax to sparsemax: A sparse model of attention andmulti‐label classification”. In: Proc. of ICML.

Martins, André FT, Mário AT Figueiredo, et al. (2015). “AD3: Alternating directions dual decomposition for MAP inference ingraphical models”. In: JMLR 16.1, pp. 495–545.

McDonald, Ryan T and Giorgio Satta (2007). “On the complexity of non‐projective data‐driven dependency parsing”. In: Proc. ofICPT.

Nangia, Nikita and Samuel Bowman (2018). “ListOps: A diagnostic dataset for latent tree learning”. In: Proc. of NAACL SRW.

55

References IVNiculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structured neural attention”. In: Proc. of

NeurIPS.Niculae, Vlad and André FT Martins (2020). “LP‐SparseMAP: Differentiable relaxed optimization for sparse structuredprediction”. In: preprint arXiv:2001.04437.

Niculae, Vlad, André FT Martins, Mathieu Blondel, et al. (2018). “SparseMAP: Differentiable sparse structured inference”. In: Proc.of ICML.

Niculae, Vlad, André FT Martins, and Claire Cardie (2018). “Towards dynamic computation graphs via sparse latent structure”. In:Proc. of EMNLP.

Nocedal, Jorge and Stephen Wright (1999). Numerical Optimization. Springer New York.

Parikh, Ankur et al. (2016). “A decomposable attention model for natural language inference”. In: Proc. of EMNLP.

Peters, Ben, Vlad Niculae, and André FT Martins (2019). “Sparse sequence‐to‐sequence models”. In: Proc. ACL.

Rabiner, Lawrence R. (1989). “A tutorial on Hidden Markov Models and selected applications in speech recognition”. In: P. IEEE77.2, pp. 257–286.

Smith, David A and Noah A Smith (2007). “Probabilistic models of nonprojective dependency trees”. In: Proc. of EMNLP.

56

References VTai, Kai Sheng, Richard Socher, and Christopher D Manning (2015). “Improved semantic representations from tree‐structuredLong Short‐Term Memory networks”. In: Proc. of ACL‐IJCNLP.

Taskar, Ben (2004). “Learning structured prediction models: A large margin approach”. PhD thesis. Stanford University.

Tibshirani, Robert et al. (2005). “Sparsity and smoothness via the fused lasso”. In: Journal of the Royal Statistical Society: Series B(Statistical Methodology) 67.1, pp. 91–108.

Tsallis, Constantino (1988). “Possible generalization of Boltzmann‐Gibbs statistics”. In: Journal of Statistical Physics 52,pp. 479–487.

Valiant, Leslie G (1979). “The complexity of computing the permanent”. In: Theor. Comput. Sci. 8.2, pp. 189–201.

Wainwright, Martin J and Michael I Jordan (2008). Graphical models, exponential families, and variational inference. Vol. 1. 1–2. NowPublishers, Inc., pp. 1–305.

Williams, Adina, Nikita Nangia, and Samuel R Bowman (2017). “A broad‐coverage challenge corpus for sentence understandingthrough inference”. In: preprint arXiv:1704.05426.

Wolfe, Philip (1976). “Finding the nearest point in a polytope”. In:Mathematical Programming 11.1, pp. 128–149.

57

top related