vene.ro · differentiablesparsestructuredprediction (niculaeandmartins,2020) tree budget budget...

Learning with Sparse Latent StructureVlad Niculae

Instituto de Telecomunicações

Work with: André Martins, Claire Cardie, Mathieu Blondel

github.com/deep-spin/lp-sparsemap @vnfrombucharest https://vene.ro

Rich Underlying Structure

author

/ /segmentation:sentences,words,and so on

entities

relationshipse.g., dependency

Most of this structure is hidden.

author

entities

author

entities

author

entities

author

entities

author

entities

Rich Underlying StructureWidely occuring pattern!

speech(Andre‐Obrecht, 1988)

objects(Long et al., 2015)

transition graphs(Kipf, Pol, et al., 2020)

But we’ll focus on NLP.

Rich Underlying StructureWidely occuring pattern!

speech(Andre‐Obrecht, 1988)

objects(Long et al., 2015)

transition graphs(Kipf, Pol, et al., 2020)

But we’ll focus on NLP.

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

· · ·

wheels

hondopwielen

wheels

hondopwielen

wheels

hondopwielen

Structured Prediction

VERB PREP NOUNdog on wheels

NOUN PREP NOUNdog on wheels

NOUN DET NOUNdog on wheels

· · ·

⋆ dog on wheels

· · ·

wheels

hondopwielen

wheels

hondopwielen

wheels

hondopwielen

Structured Prediction· · ·

· · ·5

Traditional Pipeline Approach

pretrained parser

output

positive

neutral

negative

Traditional Pipeline Approach

pretrained parser

output

positive

neutral

negative

Deep Learning & Hidden Representations

dense vector

output

positive

neutral

negative

Deep Learning & Hidden Representations

dense vector

output

positive

neutral

negative

Latent Structure Models

· · ·

output

positive

neutral

negative

*record scratch*

*freeze frame*

How to select an itemfrom a set?

How to select an item from a set?

· · ·

· · ·cN

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=? or, essentially, ∂p

∂θ=?

· · ·cN

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂θ=?

· · ·cN

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂θ=?

· · ·cN

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂θ=?

· · ·cN

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂y∂w=?

or, essentially, ∂p∂θ=?

· · ·cN

inputx

outputy

θ = f(x;w) y = g(p, x;w)

∂θ=?

Argmaxθ

· · ·cN

∂p∂θ=?

Argmaxθ

· · ·cN

∂p∂θ=?

Argmaxθ

· · ·cN

∂p∂θ=?

Argmaxθ

· · ·cN

∂p∂θ=?

Argmaxθ

· · ·cN

∂p∂θ=?

Argmaxθ

· · ·cN

∂p∂θ=?

Argmaxθ

· · ·cN

∂p∂θ=?

Argmax

· · ·cN

∂p∂θ = 0

θ2 − 1 θ2 θ2 + 1

Argmax vs. Softmax

· · ·cN

∂p∂θ = diag(p) − pp

θ2 − 1 θ2 θ2 + 1

pj = exp(θj)/Z

A Softmax Origin Story

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5 11.5

0.5 1 1.5

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

0.5 11.5

0.5 1 1.5

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

p = [1,0]

0.5 11.5

0.5 1 1.5

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

p = [1,0]

p = [0,1]

0.5 11.5

0.5 1 1.5

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

0.5 11.5

0.5 1 1.5

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

0.5 11.5

0.5 1 1.5

N = 315

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

0.5 11.5

0.5 1 1.5

p = [0,1,0]

N = 315

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

0.5 11.5

0.5 1 1.5

p = [0,1,0]

p = [0,0,1]

N = 315

= p ∈ RN : p ≥ 0, 1⊤p = 1

0.5 1 1.5

p = [1,0]

p = [0,1]

p = [1/2, 1/2]

0.5 11.5

0.5 1 1.5

p = [0,1,0]

p = [0,0,1]

p = [1/3, 1/3, 1/3]

N = 315

A Softmax Origin Storymaxjθj =max

p∈p⊤θ Fundamental Thm. Lin. Prog.

(Dantzig et al., 1955)

0.5 1 1.5

0.5 11.5

0.5 1 1.5

N = 315

0.5 1 1.5

1.5θ = [.2,1.4]

0.5 11.5

0.5 1 1.5

N = 315

0.5 1 1.5

1.5θ = [.2,1.4]

0.5 11.5

0.5 1 1.5

N = 315

0.5 1 1.5

1.5θ = [.2,1.4]

p⋆ = [0,1]

0.5 11.5

0.5 1 1.5

N = 315

0.5 1 1.5

1.5θ = [.2,1.4]

p⋆ = [0,1]

0.5 11.5

0.5 1 1.5

1.5θ = [.7, .1,1.5]

N = 315

0.5 1 1.5

1.5θ = [.2,1.4]

p⋆ = [0,1]

0.5 11.5

0.5 1 1.5

1.5θ = [.7, .1,1.5]

N = 315

0.5 1 1.5

1.5θ = [.2,1.4]

p⋆ = [0,1]

0.5 11.5

0.5 1 1.5

1.5θ = [.7, .1,1.5]

p⋆ = [0,0,1]

N = 315

Smoothed Max Operators

πΩ(θ) = argmaxp∈

p⊤θ −Ω(p)

argmax: Ω(p)=0 (no smoothing)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22α‐entmax: Ω(p)= 1/α(α−1)

∑j pαi

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Niculae and Blondel, 2017)

p⊤θ −Ω(p)

softmax: Ω(p)=∑

j pj log pj

∑j pαi

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

p⊤θ −Ω(p)

softmax: Ω(p)=∑

j pj log pj

∑j pαi

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

p⊤θ −Ω(p)

softmax: Ω(p)=∑

j pj log pj

sparsemax: Ω(p)= 1/2∥p∥22

α‐entmax: Ω(p)= 1/α(α−1)∑

j pαi

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

(Martins and Astudillo, 2016) 16

Smoothed Max OperatorsπΩ(θ) = argmax

p∈p⊤θ −Ω(p)

softmax: Ω(p)=∑

j pj log pj

∑j pαi

Tsallis (1988); a generalized entropy (Grünwald and Dawid, 2004)(Blondel, Martins, and Niculae 2019a;Peters, Niculae, and Martins 2019;Correia, Niculae, and Martins 2019)

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

softmax

sparsemax

fusedmax ?!

p⊤θ −Ω(p)

softmax: Ω(p)=∑

j pj log pj

∑j pαi

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

p⊤θ −Ω(p)

softmax: Ω(p)=∑

j pj log pj

∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

p⊤θ −Ω(p)

softmax: Ω(p)=∑

j pj log pj

∑j pαi

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

p⊤θ −Ω(p)

softmax: Ω(p)=∑

j pj log pj

∑j pαi

fusedmax: Ω(p)= 1/2∥p∥22 +∑

j |pj − pj−1|

−1 0 1

[0,0,1]

[.3, .2, .5]

[.3,0, .7]

Structured Predictionfinally

Structured Predictionis essentially a (very high‐dimensional) argmax

· · ·cN

θ pinputx

outputyc2

There are exponentiallymany structures(θ cannot fit in memory!)

· · ·

θ pinputx

outputy

· · ·

θ pinputx

outputy

Factorization Into Partsθ =A⊤η

⋆ dog on wheels

⋆→dog 1 0 0on→dog 0 1 1

wheels→dog 0 0 0⋆→on 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1

.2−.1.3.8.1−.3.2−.1

⋆ dog on wheels

⋆→dog 1 0 0on→dog 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

.2−.1.3.8.1−.3.2−.1

dog→

wheels→

dog on wheels

ay = [010 100 001]

⋆ dog on wheels

⋆→dog 1 0 0on→dog 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

.2−.1.3.8.1−.3.2−.1

dog→

wheels→

dog on wheels

ay = [010 100 001]

⋆ dog on wheels

⋆→dog 1 0 0on→dog 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

.2−.1.3.8.1−.3.2−.1

dog→

wheels→

dog on wheels

ay = [010 100 001]

⋆ dog on wheels

⋆→dog 1 0 0on→dog 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

.2−.1.3.8.1−.3.2−.1

dog→

wheels→

dog on wheels

ay = [010 100 001]

⋆ dog on wheels

⋆→dog 1 0 0on→dog 0 1 1

dog→on 1 ... 0 0 ...wheels→on 0 0 0

.2−.1.3.8.1−.3.2−.1

wheels

hondopwielen

dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1

wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1

.2−.1.3.8.1−.3.2−.1

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

sparsemax argmaxp∈

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

marginals argmaxμ∈M

μ⊤η + eH(μ)SparseMAP argmax

μ∈Mμ⊤η − 1/2∥μ∥2

e.g. dependency parsing→ Chu‐Liu/Edmondsmatching→ Kuhn‐Munkres

e.g. sequence labeling→ forward‐backward(Rabiner, 1989)

As attention: (Kim et al., 2017)

e.g. dependency parsing→ the Matrix‐Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Satta, 2007)

As attention: (Liu and Lapata, 2018)

e.g. matchings→ #P‐complete!(Taskar, 2004; Valiant, 1979)

M := convah : h ∈H

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ∈Mμ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ∈Mμ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ∈Mμ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ∈Mμ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ∈Mμ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ∈Mμ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ∈Mμ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ⊤η + eH(μ)

SparseMAP argmaxμ∈M

μ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ⊤η + eH(μ)

μ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ⊤η + eH(μ)

μ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ⊤η + eH(μ)

μ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ⊤η + eH(μ)

μ⊤η − 1/2∥μ∥2

=Ap : p ∈

=EH∼p aH : p ∈

argmax argmaxp∈

p⊤θ

softmax argmaxp∈

p⊤θ +H(p)

p⊤θ − 1/2∥p∥2

MAP argmaxμ∈M

μ⊤η

μ∈Mμ⊤η − 1/2∥μ∥2

(Niculae, Martins, Blondel, and Cardie, 2018)

=Ap : p ∈

=EH∼p aH : p ∈

Algorithms for SparseMAPμ⋆ = argmax

μ∈Mμ⊤η − 1/2∥μ∥2

Conditional Gradient(Frank and Wolfe, 1956; Lacoste‐Julien and Jaggi, 2015)

• select a new corner ofM• update the (sparse) coefficients of p

• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

∂μ∂η is sparse

computing∂μ∂η

takes O(dim(μ) nnz(p⋆))

quadratic objectivelinear constraints(alas, exponentially many!)

ay⋆ = argmaxμ∈M

μ⊤ (η −μ(t−1))︸︷︷︸eηActive Set achieves

finite & linear convergence!

Completely modular: just add MAP

μ∈Mμ⊤η − 1/2∥μ∥2

Backward pass

computing∂μ∂η

μ∈Mμ⊤η − 1/2∥μ∥2

Backward pass

computing∂μ∂η

μ∈Mμ⊤η − 1/2∥μ∥2

• select a new corner ofM

• update the (sparse) coefficients of p

Backward pass

computing∂μ∂η

μ∈Mμ⊤η − 1/2∥μ∥2

• select a new corner ofM

• update the (sparse) coefficients of p

Backward pass

computing∂μ∂η

μ⊤ (η −μ(t−1))︸︷︷︸eη

Active Set achievesfinite & linear convergence!

μ∈Mμ⊤η − 1/2∥μ∥2

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise

• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

computing∂μ∂η

μ∈Mμ⊤η − 1/2∥μ∥2

• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away‐step, pairwise• Quadratic objective: Active Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Martins, Figueiredo, et al., 2015)

Backward pass

computing∂μ∂η

μ∈Mμ⊤η − 1/2∥μ∥2

Backward pass

computing∂μ∂η

μ⊤ (η −μ(t−1))︸︷︷︸eη

Active Set achievesfinite & linear convergence!

μ∈Mμ⊤η − 1/2∥μ∥2

Backward pass

computing∂μ∂η

μ∈Mμ⊤η − 1/2∥μ∥2

Backward pass

computing∂μ∂η

μ∈Mμ⊤η − 1/2∥μ∥2

Backward pass

computing∂μ∂η

SparseMAP Applications

• Sparse alignment attention (more later)(Niculae, Martins, Blondel, and Cardie, 2018)

• Latent TreeLSTM(Niculae, Martins, and Cardie, 2018)

• As loss: supervised dependency parsing(Niculae, Martins, Blondel, and Cardie 2018;Blondel, Martins, and Niculae 2019b)

Latent Dependency TreesListOps (Nangia and Bowman, 2018)

Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)

(max 2 9 (min 4 7 ) 0 )

Latent Dependency TreesListOps (Nangia and Bowman, 2018)

Arity tagging with latent GCN (Corro and Titov, 2019; Kipf and Welling, 2017)

(max 2 9 (min 4 7 ) 0 )4 - - 2 - - - - -

0 20 40 60 80 100

validationF1score Gold tree

Left‐to‐right

0 20 40 60 80 100

Latent treeLeft‐to‐right

What if MAP is notavailable?

Multiple, Overlapping Factors

Maximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

dog→

wheels→

BUDGET

⋆ dog on wheels

dog→

wheels→TREE

BUDGET

⋆ dog on wheels

dog→

wheels→TREE

BUDGET

Multiple, Overlapping FactorsMaximization in factor graphs: NP‐hard, even when each factor is tractable.

⋆ dog on wheels

dog→

wheels→TREE

BUDGET

Optimization as Consensus-Seeking

μ[1:3]

μ[4:6]

μ[7:9]

Agreement on overlap: μa,[4:6] =μb,[4:6] =μ[4:6]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μf ∈Mf for f ∈ F

LP relaxation (Wainwright and Jordan, 2008)

the local polytope:L :=μ : Cfμ ∈Mf, f ∈ F ⊇M

(Niculae and Martins, 2020)

μ[1:3]

μ[4:6]

μ[7:9]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μ[1:3]

μ[4:6]

μ[7:9]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μ[1:3]

μ[4:6]

μ[7:9]

maxμf

∑f∈F

η⊤f μf s.t.

Cfμ =μf,

μ[1:3]

μ[4:6]

μ[7:9]

maxμ,μf

∑f∈F

η⊤f μf s.t. Cfμ =μf, μf ∈Mf for f ∈ F

μ[1:3]

μ[4:6]

μ[7:9]

maxμ,μf

∑f∈F

μ[1:3]

μ[4:6]

μ[7:9]

maxμ,μf

∑f∈F

μ[1:3]

μ[4:6]

μ[7:9]

maxμ,μf

∑f∈F

η⊤f μf− 1/2∥μ∥2 s.t. Cfμ =μf, μf ∈Mf for f ∈ F

Algorithms for LP-SparseMAP(Niculae and Martins, 2020)

Forward pass

argmaxCfμ=μf

∑f∈F

η⊤f μf− 1/2∥μ∥2

=argmaxCf μ=μf

∑f∈Fη⊤f μf − 1/2∥Dfμf∥2

• Separable objective,agreement constraintsADMM in consensus form

• SparseMAP subproblem for each f

Backward pass

• Sparse fixed‐point iteration

• Combines the SparseMAP Jacobiansof each factor

Algorithms for LP-SparseMAP(Niculae and Martins, 2020)

Forward pass

argmaxCfμ=μf

∑f∈F

η⊤f μf− 1/2∥μ∥2

=argmaxCf μ=μf

∑f∈Fη⊤f μf − 1/2∥Dfμf∥2

• Separable objective,agreement constraintsADMM in consensus form

• SparseMAP subproblem for each f

Backward pass

• Sparse fixed‐point iteration

• Combines the SparseMAP Jacobiansof each factor

Differentiable Sparse Structured Prediction(Niculae and Martins, 2020)

BUDGET

Factor graphs as a hidden‐layer DSL!

If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

New factors only require MAP.

fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave

fg.add(Tree(var))

for i in range(n):fg.add(Budget(var[i, :], budget=5)

μ = fg.lp_sparsemap(η)

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.

Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

fg = FactorGraph()var = [fg.variable() for i =/ j] # handwave

fg.add(Tree(var))

for i in range(n):fg.add(Budget(var[i, :], budget=5)

μ = fg.lp_sparsemap(η)

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

class Factor:def map(ηf): # abstract, private

raise NotImplemented

def sparsemap(ηf):# active set algo, uses self.map

def backward(dμf):# analytic, uses active set result

class Budget(Factor):def sparsemap(ηf):

# specialized

def backward(dμf):# specialized

BUDGET

Factor graphs as a hidden‐layer DSL!If |F | = 1, recovers SparseMAP.Modular library.Built‐in specialized factors:• OR, XOR, AND• OR‐with‐output• Budget, Knapsack• Pairwise

class Factor:def map(ηf): # abstract, private

raise NotImplemented

def sparsemap(ηf):# active set algo, uses self.map

def backward(dμf):# analytic, uses active set result

class Budget(Factor):def sparsemap(ηf):

# specialized

def backward(dμf):# specialized

class Tree(Factor):def map(η):

# Chu-Liu/Edmonds algo32

0 20 40 60 80 100

Latent treeLeft‐to‐right

0 20 40 60 80 100

Latent w/ Budget(5)Latent treeLeft‐to‐right

Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situation.

hypothesis: A police officer watches a situation closely.

(P, H)

gentleman

overlooking

situation

police

officer...

closely

output

entails

contradicts

neutral

(Model: decomposable attention (Parikh et al., 2016))34

(P, H)

gentleman

overlooking

situation

police

officer...

closely

output

entails

contradicts

neutral

(P, H)

gentleman

overlooking

situation

police

officer...

closely

output

entails

contradicts

neutral

(P, H)

gentleman

overlooking

situation

police

officer...

closely

output

entails

contradicts

neutral

(Proposed model: global structured alignment.)34

Structured Alignment Modelsmatching

SparseMAP w/ Kuhn‐Munkres(Kuhn, 1955)

LP‐matching

LP‐SparseMAP w/ XORs(equivalent; different solver)

LP‐sequence

additional scorefor contiguous alignments

(i, j) − (i + 1, j± 1)

LP‐matching

LP‐sequence

(i, j) − (i + 1, j± 1)

LP‐matching

LP‐sequence

(i, j) − (i + 1, j± 1)35

MultiNLI (Williams et al., 2017)

softmax matching LP‐matching LP‐sequence66%

gentleman

overlooking

neighborhood

situation

police

officer

watches a

situation

closely . a

police

officer

watches a

situation

closely .

police

officer

watches a

situation

closely .

gentleman

overlooking

neighborhood

situation

gentleman

overlooking

neighborhood

situation

police

officer

watches

situation

closely

ConclusionsDifferentiable & sparsestructured inference

Generic, extensible, efficient algorithms

Interpretable structured attention

Future workStructure beyond NLP

Weak & semi‐supervision

Generative latent structure models

vlad@vene.ro github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest

· · ·

police

officer

watches a

situation

closely .

gentleman

overlooking

neighborhood

situation

BUDGET

ConclusionsDifferentiable & sparsestructured inference

Generic, extensible, efficient algorithms

Interpretable structured attention

Future workStructure beyond NLP

Weak & semi‐supervision

Generative latent structure models

vlad@vene.ro github.com/deep-spin/lp-sparsemaphttps://vene.ro @vnfrombucharest

· · ·

police

officer

watches a

situation

closely .

gentleman

overlooking

neighborhood

situation

BUDGET

Extra slides

Acknowledgements

This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013.

Some icons by Dave Gandy and Freepik via flaticon.com.

Sparsemaxsparsemax(θ) = argmax

p∈p⊤θ − 1/2∥p∥22

= argminp∈∥p − θ∥22

Computation:

p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via partial sort

(Held et al., 1974; Brucker, 1984; Condat, 2016)

Backward pass:

Jsparsemax = diag(s) − 1|S|ss⊤

where S = j : p⋆j > 0,sj = Jj ∈ SK

(Martins and Astudillo, 2016)

argmin differentiation(Gould et al., 2016; Amos and Kolter, 2017)

p∈p⊤θ − 1/2∥p∥22

Computation:

Backward pass:

p∈p⊤θ − 1/2∥p∥22

Computation:

Backward pass:

p∈p⊤θ − 1/2∥p∥22

Computation:

Backward pass:

Fusedmax

fusedmax(θ) = argmaxp∈

p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|

= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

Proposition: fusedmax(θ) = sparsemaxproxfused(θ)

“Fused Lasso” a.k.a. 1‐d Total Variation

(Tibshirani et al., 2005)

Fusedmax

fusedmax(θ) = argmaxp∈

p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|

= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|

Proposition: fusedmax(θ) = sparsemaxproxfused(θ)

“Fused Lasso” a.k.a. 1‐d Total Variation

(Tibshirani et al., 2005)

Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)

Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max

z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z

φ(x, z).

Example: maximum of a vector

∂maxj∈[d] θj = ∂max

p∈ p⊤θ

= ∂maxp∈ φ(p,θ)

= conv∇θφ(p⋆,θ)= convp⋆

θ = [t,0]

−1 0 +1maxj θj

−1 0 +1g1 | g ∈ ∂maxj θj

φ(x, z).

p∈ p⊤θ

θ = [t,0]

−1 0 +1maxj θj

−1 0 +1g1 | g ∈ ∂maxj θj

φ(x, z).

p∈ p⊤θ

θ = [t,0]

−1 0 +1maxj θj

−1 0 +1g1 | g ∈ ∂maxj θj

Dynamically inferringthe computation graph

So far: a structured hidden layerEH[aH]

Network must handle “soft” combinations of structures.Fine for attention, but can be limiting.

Dependency TreeLSTM

The bears eat the pretty ones

(Tai et al., 2015)

Dependency TreeLSTM

(Tai et al., 2015)

Dependency TreeLSTM

(Tai et al., 2015)

Dependency TreeLSTM

(Tai et al., 2015)

Dependency TreeLSTM

(Tai et al., 2015)

Dependency TreeLSTM

(Tai et al., 2015)

Dependency TreeLSTM

(Tai et al., 2015)

Latent Dependency TreeLSTM

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈H

output

(Niculae, Martins, and Cardie, 2018)

Latent Dependency TreeLSTM

p(y|x) =∑h∈H

p(y | h, x) p(h | x)

h ∈HThe bears eat the pretty ones

output

(Niculae, Martins, and Cardie, 2018)

Structured Latent Variable Models

p(y | x) = ∑h∈H

(y | h, x) p

(h | x)

How to define pπ? ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp

scoreπ(h; x)

softmaxidea 3 SparseMAP

e.g., a TreeLSTM defined by hsum overall possible trees

parsing model,using some scoreπ(h; x)

Exponentially large sum!

p(y | x) = ∑h∈H

pφ(y | h, x) pπ(h | x)

∂p(y | x)∂π

scoreπ(h; x)

p(y | x) = ∑h∈H

∂p(y | x)∂π

scoreπ(h; x)

e.g., a TreeLSTM defined by h

sum overall possible trees

p(y | x) = ∑h∈H

∂p(y | x)∂π

scoreπ(h; x)

e.g., a TreeLSTM defined by h

sum overall possible trees

p(y | x) = ∑h∈H

∂p(y | x)∂π

scoreπ(h; x)

p(y | x) = ∑h∈H

How to define pπ?

∑h∈H

∂p(y | x)∂π

idea 1

pπ(h | x) = 1 if h = h⋆ else 0 argmax

idea 2

pπ(h | x) ∝ expscoreπ(h; x)

softmax

idea 3

SparseMAP

p(y | x) = ∑h∈H

∂p(y | x)∂π

idea 1

idea 2

softmax

idea 3

SparseMAP

p(y | x) = ∑h∈H

∂p(y | x)∂π

idea 1

idea 2

softmax

idea 3

SparseMAP

p(y | x) = ∑h∈H

∂p(y | x)∂π

idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2

softmax

idea 3

SparseMAP

p(y | x) = ∑h∈H

∂p(y | x)∂π

softmax

idea 3

SparseMAP

p(y | x) = ∑h∈H

∂p(y | x)∂π

softmax

idea 3

SparseMAP

p(y | x) = ∑h∈H

∂p(y | x)∂π

scoreπ(h; x)

softmaxidea 3

SparseMAP

p(y | x) = ∑h∈H

∂p(y | x)∂π

scoreπ(h; x)

softmaxidea 3

SparseMAP

p(y | x) = ∑h∈H

∂p(y | x)∂π

scoreπ(h; x)

softmaxidea 3

SparseMAP

p(y | x) = ∑h∈H

∂p(y | x)∂π

scoreπ(h; x)

SparseMAP

• • •= .7 • • • + .3 • • •

+ 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...

p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

SparseMAP

• • •= .7 • • • + .3 • • • + 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )

Sentiment classification (SST)

Natural Language Inference (SNLI)

LTR Flat CoreNLP Latent

82%accuracy(3‐class)

Reverse dictionary lookup(definitions)

LTR Flat Latent30%

38%accuracy@10

(concepts)

LTR Flat Latent30%

38%accuracy@10

⋆ The bears eat the pretty ones

Left‐to‐right: regular LSTM

Flat: bag‐of‐words–like

CoreNLP: off‐line parser

Sentence pair classification (P,H)

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

LTR80%

LTR Flat Latent30%

38%accuracy@10

(concepts)

LTR Flat Latent30%

38%accuracy@10

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

∑h∈H

g(x;h) pπ(h | x)

LTR Flat80%

LTR Flat Latent30%

38%accuracy@10

(concepts)

LTR Flat Latent30%

38%accuracy@10

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

∑h∈H

g(x;h) pπ(h | x)

LTR Flat CoreNLP80%

LTR Flat Latent30%

38%accuracy@10

(concepts)

LTR Flat Latent30%

38%accuracy@10

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

∑h∈H

g(x;h) pπ(h | x)

LTR Flat CoreNLP Latent80%

LTR Flat Latent30%

38%accuracy@10

(concepts)

LTR Flat Latent30%

38%accuracy@10

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

∑h∈H

g(x;h) pπ(h | x)

85%accuracy(binary)

LTR Flat Latent30%

38%accuracy@10

(concepts)

LTR Flat Latent30%

38%accuracy@10

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

∑h∈H

g(x;h) pπ(h | x)

85%accuracy(binary)

LTR Flat Latent30%

38%accuracy@10

(concepts)

LTR Flat Latent30%

38%accuracy@10

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

85%accuracy(binary)

Reverse dictionary lookup

(definitions)

LTR Flat Latent30%

38%accuracy@10

(concepts)

LTR Flat Latent30%

38%accuracy@10

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)

given word description, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =

∑h∈H

g(x;h) pπ(h | x)

85%accuracy(binary)

LTR Flat Latent30%

38%accuracy@10

(concepts)

LTR Flat Latent30%

38%accuracy@10

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

∑h∈H

g(x;h) pπ(h | x)

85%accuracy(binary)

LTR Flat Latent30%

38%accuracy@10

(concepts)

LTR Flat Latent30%

38%accuracy@10

p(y | P,H) = ∑hP∈H (P)

∑hH∈H (H)

∑h∈H

g(x;h) pπ(h | x)

Syntax vs. Composition Order

p = 22.6%

⋆ lovely and poignant .

CoreNLP parse, p = 21.4%

· · ·

Syntax vs. Composition Orderp = 22.6%

· · · 50

Syntax vs. Composition Order

p = 22.6%

· · ·

p = 15.33%

⋆ a deep and meaningful film .1.0 1.0

1.01.0

1.0 1.0

p = 15.27%

⋆ a deep and meaningful film .

1.01.0

· · ·CoreNLP parse, p = 0%

⋆ a deep and meaningful film .

1.0 1.0

1.01.0

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2

cost‐SparseMAP LρA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)

Structured Output Prediction

SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2

− η⊤μ + 1/2∥μ∥2cost‐SparseMAP LρA(η, μ) = max

μ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)

− η⊤μ + 1/2∥μ∥2

Instance of a structured Fenchel‐Young loss, like CRF, SVM, etc. (Blondel, Martins, and Niculae, 2019b)

References IAmos, Brandon and J. Zico Kolter (2017). “OptNet: Differentiable optimization as a layer in neural networks”. In: Proc. of ICML.

Andre‐Obrecht, Regine (1988). “A new statistical approach for the automatic segmentation of continuous speech signals”. In: IEEETransactions on Acoustics, Speech, and Signal Processing 36.1, pp. 29–40.

Bertsekas, Dimitri P (1999). Nonlinear Programming. Athena Scientific Belmont.

Blondel, Mathieu, André FT Martins, and Vlad Niculae (2019a). “Learning classifiers with Fenchel‐Young losses: Generalizedentropies, margins, and algorithms”. In: Proc. of AISTATS.

— (2019b). “Learning with Fenchel‐Young Losses”. In: preprint arXiv:1901.02324.

Brucker, Peter (1984). “An O(n) algorithm for quadratic knapsack problems”. In: Operations Research Letters 3.3, pp. 163–166.

Condat, Laurent (2016). “Fast projection onto the simplex and the ℓ1 ball”. In:Mathematical Programming 158.1‐2, pp. 575–585.

Correia, Gonçalo M., Vlad Niculae, and André FT Martins (2019). “Adaptively Sparse Transformers”. In: Proc. EMNLP.

Corro, Caio and Ivan Titov (2019). “Learning latent trees with stochastic perturbations and differentiable dynamic programming”.In: Proc. of ACL.

Danskin, John M (1966). “The theory of max‐min, with applications”. In: SIAM Journal on Applied Mathematics 14.4, pp. 641–664.

References IIDantzig, George B, Alex Orden, and Philip Wolfe (1955). “The generalized simplex method for minimizing a linear form underlinear inequality restraints”. In: Pacific Journal of Mathematics 5.2, pp. 183–195.

Frank, Marguerite and Philip Wolfe (1956). “An algorithm for quadratic programming”. In: Nav. Res. Log. 3.1‐2, pp. 95–110.

Gould, Stephen et al. (2016). “On differentiating parameterized argmin and argmax problems with application to bi‐leveloptimization”. In: preprint arXiv:1607.05447.

Grünwald, Peter D and A Philip Dawid (2004). “Game theory, maximum entropy, minimum discrepancy and robust Bayesiandecision theory”. In: Annals of Statistics, pp. 1367–1433.

Held, Michael, Philip Wolfe, and Harlan P Crowder (1974). “Validation of subgradient optimization”. In:Mathematical Programming6.1, pp. 62–88.

Hill, Felix et al. (2016). “Learning to understand phrases by embedding the dictionary”. In: TACL 4.1, pp. 17–30.

Kim, Yoon et al. (2017). “Structured attention networks”. In: Proc. of ICLR.

Kipf, Thomas, Elise van der Pol, and Max Welling (2020). “Contrastive Learning of Structured World Models”. In: Proc. of ICLR.

Kipf, Thomas and Max Welling (2017). “Semi‐supervised classification with graph convolutional networks”. In: Proc. of ICLR.

References IIIKoo, Terry et al. (2007). “Structured prediction models via the matrix‐tree theorem”. In: Proc. of EMNLP.

Kuhn, Harold W (1955). “The Hungarian method for the assignment problem”. In: Nav. Res. Log. 2.1‐2, pp. 83–97.

Lacoste‐Julien, Simon and Martin Jaggi (2015). “On the global linear convergence of Frank‐Wolfe optimization variants”. In: Proc.of NeurIPS.

Liu, Yang and Mirella Lapata (2018). “Learning structured text representations”. In: TACL 6, pp. 63–75.

Long, Jonathan, Evan Shelhamer, and Trevor Darrell (2015). “Fully convolutional networks for semantic segmentation”. In: Proc. ofCVPR.

Martins, André FT and Ramón Fernandez Astudillo (2016). “From softmax to sparsemax: A sparse model of attention andmulti‐label classification”. In: Proc. of ICML.

Martins, André FT, Mário AT Figueiredo, et al. (2015). “AD3: Alternating directions dual decomposition for MAP inference ingraphical models”. In: JMLR 16.1, pp. 495–545.

McDonald, Ryan T and Giorgio Satta (2007). “On the complexity of non‐projective data‐driven dependency parsing”. In: Proc. ofICPT.

Nangia, Nikita and Samuel Bowman (2018). “ListOps: A diagnostic dataset for latent tree learning”. In: Proc. of NAACL SRW.

References IVNiculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structured neural attention”. In: Proc. of

NeurIPS.Niculae, Vlad and André FT Martins (2020). “LP‐SparseMAP: Differentiable relaxed optimization for sparse structuredprediction”. In: preprint arXiv:2001.04437.

Niculae, Vlad, André FT Martins, Mathieu Blondel, et al. (2018). “SparseMAP: Differentiable sparse structured inference”. In: Proc.of ICML.

Niculae, Vlad, André FT Martins, and Claire Cardie (2018). “Towards dynamic computation graphs via sparse latent structure”. In:Proc. of EMNLP.

Nocedal, Jorge and Stephen Wright (1999). Numerical Optimization. Springer New York.

Parikh, Ankur et al. (2016). “A decomposable attention model for natural language inference”. In: Proc. of EMNLP.

Peters, Ben, Vlad Niculae, and André FT Martins (2019). “Sparse sequence‐to‐sequence models”. In: Proc. ACL.

Rabiner, Lawrence R. (1989). “A tutorial on Hidden Markov Models and selected applications in speech recognition”. In: P. IEEE77.2, pp. 257–286.

Smith, David A and Noah A Smith (2007). “Probabilistic models of nonprojective dependency trees”. In: Proc. of EMNLP.

References VTai, Kai Sheng, Richard Socher, and Christopher D Manning (2015). “Improved semantic representations from tree‐structuredLong Short‐Term Memory networks”. In: Proc. of ACL‐IJCNLP.

Taskar, Ben (2004). “Learning structured prediction models: A large margin approach”. PhD thesis. Stanford University.

Tibshirani, Robert et al. (2005). “Sparsity and smoothness via the fused lasso”. In: Journal of the Royal Statistical Society: Series B(Statistical Methodology) 67.1, pp. 91–108.

Tsallis, Constantino (1988). “Possible generalization of Boltzmann‐Gibbs statistics”. In: Journal of Statistical Physics 52,pp. 479–487.

Valiant, Leslie G (1979). “The complexity of computing the permanent”. In: Theor. Comput. Sci. 8.2, pp. 189–201.

Wainwright, Martin J and Michael I Jordan (2008). Graphical models, exponential families, and variational inference. Vol. 1. 1–2. NowPublishers, Inc., pp. 1–305.

Williams, Adina, Nikita Nangia, and Samuel R Bowman (2017). “A broad‐coverage challenge corpus for sentence understandingthrough inference”. In: preprint arXiv:1704.05426.

Wolfe, Philip (1976). “Finding the nearest point in a polytope”. In:Mathematical Programming 11.1, pp. 128–149.

vene.ro · differentiablesparsestructuredprediction (niculaeandmartins,2020) tree budget budget...

Documents

interregional redistribution and budget institutions with...

if tree could talk

if a tree could talk

light-tree routing under optical power budget...

if i was a tree

if you’re happy and you know it! - scholastic...

balanced search trees - kent state...

silk artificial flowers, trees and plants, fake flowers...

system issues system decomposition / product tree system...

quiz 4 : csp tree-structured csps require only polynomial...

if i were as tall as a tree

if a tree falls in a forest… the importance of engagement

python tree data...python tree data, release 2.8.0 2.2basics...

street tree strategy risk management procedures … · tree...

hazard tree awareness - bchcalifornia.org tree...

finance, budget, resources and staffing committee · rg...

budget adjustments tier iii · 2019. 12. 12. · final...

superintendent’s proposed budget 2015-16. 2 “if you plan...

if you have planted a tree you must water it too

tree data structures - intuitionke.weebly.com · if the...