learning structural svms with latent variableslearning structural svms with latent variables...

Learning Structural SVMswith Latent Variables

Chun-Nam Yu

Dept. of Computer Science, Cornell University

October 8-9, IBM SMiLe Workshop

C.-N. Yu (Cornell) Latent Structural SVMs Oct 8-9, IBM SMiLe Workshop 1 / 21

Structured Output PredictionTraditional classification and regression

Structured output prediction


Introduction to Structural SVMsStructural SVM (Margin rescaling) [Tsochantardis et.al ’04]

min~w ,~ξ

12‖~w‖2 + C

n∑i=1

ξi

s.t . for 1 ≤ i ≤ n, for all output structures y ∈ Y ,~w · Φ(xi , yi)− ~w · Φ(xi , y) ≥ ∆(yi , y)− ξi

Loss function ∆ controls the penalty of predicting y insteadof yi



min~w ,~ξ

12‖~w‖2 + C

n∑i=1

ξi


~w ·Φ( , )

︸︷︷︸score of correct parse tree




min~w ,~ξ

12‖~w‖2 + C

n∑i=1

ξi


~w ·Φ( , )


~w ·Φ( , )

︸︷︷︸score of wrong parse tree




min~w ,~ξ

12‖~w‖2 + C

n∑i=1

ξi

s.t . for 1 ≤ i ≤ n, for all output structures y ∈ Y ,~w · Φ(xi , yi)− ~w · Φ(xi , y)≥∆(yi , y)− ξi

~w ·Φ( , )


≥ ~w ·Φ( , )





min~w ,~ξ

12‖~w‖2 + C

n∑i=1

ξi


~w ·Φ( , )


≥ ~w ·Φ( , )


Loss function ∆ controls the penalty of predicting y insteadof yiC.-N. Yu (Cornell) Latent Structural SVMs Oct 8-9, IBM SMiLe Workshop 3 / 21

Solving Margin-based Training Problems withthe Cutting-Plane Algorithm

Exponentially many constraints, but solvable in polynomialtime

using the cutting-planealgorithm to speed uptraining of structural SVMs[Joachims, Finley & Yu,MLJ’09]

using approximatecutting-plane models tobuild faster and sparserkernel SVMs[Yu & Joachims, KDD’08],[Joachims & Yu, ECML’09;Best Machine Learning Paper]


Incomplete Label Information and LatentVariablesDiscriminative motif finding

Noun Phrase Coreference


Latent Structural Support Vector MachinesLatent Structural SVM [Yu & Joachims, ICML’09]

min~w ,~ξ

12‖~w‖2 + C

n∑i=1

ξi s.t . for 1 ≤ i ≤ n, for all outputs y ∈ Y ,

maxh∈H

~w · Φ(xi , yi ,h)− maxh∈H

~w · Φ(xi , y , h) ≥ ∆(yi , y , h)− ξi

~w · Φ(︸︷︷︸xi

, ︸︷︷︸yi

, ︸︷︷︸h′

)



min~w ,~ξ

12‖~w‖2 + C

n∑i=1


maxh∈H


~w · Φ(xi , y , h) ≥ ∆(yi , y , h)− ξi

{~w · Φ( , , )

~w · Φ(︸︷︷︸xi

, ︸︷︷︸yi

, ︸︷︷︸h′′

), . . .}



min~w ,~ξ

12‖~w‖2 + C

n∑i=1


maxh∈H


~w · Φ(xi , y , h) ≥ ∆(yi , y , h)− ξi

maxh∈H{~w · Φ( , , )

~w · Φ(︸︷︷︸xi

, ︸︷︷︸yi

, ︸︷︷︸h′′

), . . .}



min~w ,~ξ

12‖~w‖2 + C

n∑i=1


maxh∈H

~w · Φ(xi , yi ,h)−maxh∈H

~w · Φ(xi , y , h) ≥ ∆(yi , y , h)− ξi

maxh∈H{~w · Φ( , , )

~w · Φ(︸︷︷︸xi

, ︸︷︷︸y

, ︸︷︷︸h′′

), . . .}



min~w ,~ξ

12‖~w‖2 + C

n∑i=1


maxh∈H

~w · Φ(xi , yi ,h)−maxh∈H

~w · Φ(xi , y , h)≥∆(yi , y , h)− ξi

maxh∈H{~w · Φ(︸︷︷︸

xi

, ︸︷︷︸yi

, ︸︷︷︸h′

), . . . . . .}

≥maxh∈H{~w · Φ(︸︷︷︸

xi

, ︸︷︷︸y

, ︸︷︷︸h′

), . . . . . .}


Solving the Non-Convex OptimizationConcave-Convex Procedure [Yuille & Rangarajan ’03]

1 Decompose the objective into convex and concave part

2 Upper bound the concave part with a hyperplane

3 Minimize the resulting convex sum. Iterate untilconvergence

Recent works employing the CCCP algorithm[Collobert et al. ’06, Smola et al. ’05, Chapelle et al. ’08]


Solving the Non-Convex Optimization

Concave-Convex Procedure (CCCP)(1) Decompose the objective into convex and concave part

[12‖~w‖2 + C

n∑i=1

max(y ,h)∈Y×H

[~w · Φ(xi , y , h) + ∆(yi , y , h)]

]︸︷︷︸

convex

−

[C

n∑i=1

maxh∈H

~w · Φ(xi , yi ,h)

]︸︷︷︸

concave



Concave-Convex Procedure (CCCP)(2) Upper bound the concave part with a hyperplane at ~wt

∀~w ,−

[C

n∑i=1

maxh∈H

~w · Φ(xi , yi ,h)

]︸︷︷︸

concave

≤ −

[C

n∑i=1

~w · Φ(xi , yi ,h∗i )

]︸︷︷︸

linear

where h∗i = argmaxh∈H

~wt · Φ(xi , yi ,h)



Concave-Convex Procedure (CCCP)(3) Minimize the resulting convex sum to get ~wt+1

~wt+1 = min~w

[12‖~w‖2 + C

n∑i=1

max(y ,h)∈Y×H

[~w · Φ(xi , y , h) + ∆(yi , y , h)]

]︸︷︷︸

convex

−

[C

n∑i=1

~w · Φ(xi , yi ,h∗i )

]︸︷︷︸

linear


Analogy to Expectation-Maximization

E-step: equivalent to computing the upper boundinghyperplane

M-step: equivalent to minimizing the convex sum

Point estimate for latent variables; no normalization withpartition function requiredDiscriminative probabilistic models with latent variables

I [ Gunawardana et al. 05], [Wang et al. ’06], [Petrov & Klein’07]





Point estimate for latent variables; no normalization withpartition function required

Discriminative probabilistic models with latent variablesI [ Gunawardana et al. 05], [Wang et al. ’06], [Petrov & Klein

’07]





Point estimate for latent variables; no normalization withpartition function requiredDiscriminative probabilistic models with latent variables

I [ Gunawardana et al. 05], [Wang et al. ’06], [Petrov & Klein’07]


Noun Phrase CoreferenceInput x : Noun phraseswith edge features

Label y : Clusters ofnoun phrasesLatent variable h:‘Strong’ links as treesTask: Cluster thenoun phrases usingsingle-linkagglomerativeclusteringInference: MinimumSpanning Tree

[from Cardie & Wagstaff ’99]


Noun Phrase CoreferenceInput x : Noun phraseswith edge featuresLabel y : Clusters ofnoun phrases

Latent variable h:‘Strong’ links as treesTask: Cluster thenoun phrases usingsingle-linkagglomerativeclusteringInference: MinimumSpanning Tree



Noun Phrase CoreferenceInput x : Noun phraseswith edge featuresLabel y : Clusters ofnoun phrasesLatent variable h:‘Strong’ links as trees

Task: Cluster thenoun phrases usingsingle-linkagglomerativeclusteringInference: MinimumSpanning Tree



Noun Phrase CoreferenceInput x : Noun phraseswith edge featuresLabel y : Clusters ofnoun phrasesLatent variable h:‘Strong’ links as treesTask: Cluster thenoun phrases usingsingle-linkagglomerativeclustering

Inference: MinimumSpanning Tree



Noun Phrase CoreferenceInput x : Noun phraseswith edge featuresLabel y : Clusters ofnoun phrasesLatent variable h:‘Strong’ links as treesTask: Cluster thenoun phrases usingsingle-linkagglomerativeclusteringInference: MinimumSpanning Tree



Noun Phrase Coreference: Results

Test on MUC 6 data, using the same features as in [Ng &Cardie ’02]

Initialize spanning trees by chronological order

10-fold CV results:Algorithm MITRE lossSVMcluster [Finley & Joachims ’05] 41.3Latent Structural SVM 35.6


Noun Phrase Coreference: Results

Test on MUC 6 data, using the same features as in [Ng &Cardie ’02]

Initialize spanning trees by chronological order10-fold CV results:

Algorithm MITRE lossSVMcluster [Finley & Joachims ’05] 41.3Latent Structural SVM 35.6


Discriminative Motif Finding

Input x : DNA sequences containingARS from S. cerevisiae and S. kluyveri

Label y : Whether the sequencereplicates in S. cerevisiaeLatent variable h: position of the motifTask: Find out the predictive motifInference: Enumerate all positions h

S. cerevisiae

S. kluyveri




Label y : Whether the sequencereplicates in S. cerevisiae

Latent variable h: position of the motifTask: Find out the predictive motifInference: Enumerate all positions h

S. cerevisiae

S. kluyveri




Label y : Whether the sequencereplicates in S. cerevisiaeLatent variable h: position of the motif

Task: Find out the predictive motifInference: Enumerate all positions h

S. cerevisiae

S. kluyveri




Label y : Whether the sequencereplicates in S. cerevisiaeLatent variable h: position of the motifTask: Find out the predictive motif

Inference: Enumerate all positions h

S. cerevisiae

S. kluyveri




Label y : Whether the sequencereplicates in S. cerevisiaeLatent variable h: position of the motifTask: Find out the predictive motifInference: Enumerate all positions h

S. cerevisiae

S. kluyveri


Discriminative Motif Finding: Results

Data - 197 yeast DNA sequences from S. cerevisiae and S.kluyveri.∼6000 intergenic sequences for background estimation

10-fold CV, 10 random restarts for each parameter settingAlgorithm Error RateGibbs Sampler (w=11) 37.9%Gibbs Sampler (w=17) 35.06%Latent Structural SVM (w=11) 11.09%Latent Structural SVM (w=17) 12.00%


Discriminative Motif Finding: Results

Data - 197 yeast DNA sequences from S. cerevisiae and S.kluyveri.∼6000 intergenic sequences for background estimation10-fold CV, 10 random restarts for each parameter setting

Algorithm Error RateGibbs Sampler (w=11) 37.9%Gibbs Sampler (w=17) 35.06%Latent Structural SVM (w=11) 11.09%Latent Structural SVM (w=17) 12.00%


Conclusions and Future Directions

A new formulation of Latent Variable Structural SVM with anefficient solution algorithm

A modular algorithm that exhibits very good accuracies ontwo example structured prediction tasksPotential extensions to semi-supervised settingsAlso looking at situations in structured output learningwhere unlabeled data in output domain Y are plentiful



A new formulation of Latent Variable Structural SVM with anefficient solution algorithmA modular algorithm that exhibits very good accuracies ontwo example structured prediction tasks

Potential extensions to semi-supervised settingsAlso looking at situations in structured output learningwhere unlabeled data in output domain Y are plentiful



A new formulation of Latent Variable Structural SVM with anefficient solution algorithmA modular algorithm that exhibits very good accuracies ontwo example structured prediction tasksPotential extensions to semi-supervised settings

Also looking at situations in structured output learningwhere unlabeled data in output domain Y are plentiful



A new formulation of Latent Variable Structural SVM with anefficient solution algorithmA modular algorithm that exhibits very good accuracies ontwo example structured prediction tasksPotential extensions to semi-supervised settingsAlso looking at situations in structured output learningwhere unlabeled data in output domain Y are plentiful


Discriminative Motif Finding - FormulationFeature vector Φ: Position-specific weight matrix plusparameters for Markov background model

Φ(x , y ,h) =h∑

i=1

φBG(xi)︸︷︷︸background

+l∑

j=1

φ(j)PSM(xh+j)︸︷︷︸motif

+n∑

i=h+l+1


[from Wasserman 2004]

Loss function ∆: Zero-one loss

Inference: enumeration, as y is binary and h is linear insequence length


Discriminative Motif Finding - FormulationFeature vector Φ: Position-specific weight matrix plusparameters for Markov background model

Φ(x , y ,h) =h∑

i=1


+l∑

j=1

φ(j)PSM(xh+j)︸︷︷︸motif

+n∑

i=h+l+1


[from Wasserman 2004]

Loss function ∆: Zero-one lossInference: enumeration, as y is binary and h is linear insequence length


Noun Phrase Coreference - FormulationFeature vector Φ: sum of tree edge features:

Φ(x , y ,h) =∑(i,j)∈h

xij

Loss function ∆:

∆(y , y , h) = n(y)︸︷︷︸#nodes

− k(y)︸︷︷︸#components

+∑(i,j)∈h

`(y , (i , j))︸︷︷︸+1/−1

Inference: Any MaximumSpanning Tree algorithm


Optimizing Precision@kInput x : A query with anassociated collection ofdocumentsLabel y : Relevancejudgments of eachdocumentLatent variable h: Top krelevant documents

Query q: ICML 2009


Optimizing Precision@k - Formulation

Feature vector Φ: sum of features from top k documents

Φ(x , y ,h) =k∑

j=1

xhj

Loss function ∆: One minus precison@k

∆(y , y , h) = 1− 1k

k∑j=1

[yhj== 1]

Depends only on top k document selected by hInference: Sorting


Optimizing Precision@k - ResultsOHSUMED dataset from LETOR 3.0 benchmarkInitialize h with weight vector trained on classificationaccuracy5-fold CV results:


learning structural svms with latent variableslearning structural svms with latent variables...

Documents