lecture 10 kernel methods for structured outputs · lecture 10 kernel methods for structured...

Lecture 10Kernel Methods for Structured Outputs

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems GroupWilhelm Schickard Institute for Computer ScienceUniversitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 1 / 24

What is Structured Outputs?

Multiclass classification

Parsing

��

� ��

� �

Syntactic alignment

��

�� !��

� �

��

��

��

��

��

��

�� !��

(�)!��%��&��

Label sequence learning


Multi-Class Classification: One-vs-the-Rest

Advantages:

Small number of classifiers

Rejection possible

Disadvantages:

Complex classifier boundaries

Unbalanced classification problems


Pairwise Multi-Class Classification

Disadvantages:

Large number of classifiers

Rejection not possible

Advantages:

Simple classifier boundaries

Balanced classification problems


Multi-Class Classification: Distributed Encoding

Encode classes with binary words withsufficient separation between them.

Train binary classifiers corresponding tothe columns of the code matrix.

Apply all classifiers at the decision stage,decide by majority vote.

If the minimal Hamming distance betweencode words is d then ⌊d/2⌋ errors can becorrected.

code wordclass 1 2 3 4 5

1 0 0 0 1 12 0 0 1 0 13 0 1 0 0 14 0 1 1 0 05 1 0 0 1 06 1 1 0 1 17 1 1 1 0 18 1 1 1 1 0

Advantages:

Small number of classifiers

Balanced classification problems

Disadvantages:

Comples classifier boundaries

Rejection not possible


Multi-Class SVM

Consider the decision function of a one-vs-the-rest classifier:

f (x) = argmaxm∈{1,...,M}

fm(x)

Key idea: build this constraint into the training problem for eachexample with known label:

minw

1

2

M∑

m=1

(wm · wm)

subject to: (wyi · xi) + byj ≥ maxm 6=yi

(wm · xi) + bm, ∀i


Multi-Class SVM (ctd.)

The training problem can be adjusted to “soft-margin” and transformedinto the following form:

minw ,ξ

1

2

M∑

m=1

(wm · wm) + C

N∑

i=1

∑

m 6=yi

ξim

subject to: (wyi · xi ) + byi ≥ (wm · xi ) + bm + 2− ξim, ∀i ,m 6= yi

ξim ≥ 0, ∀i ,m 6= yi

Single quadratic problem with 2N(M − 1) linear constraints.


Multi-Class SVM (ctd.)

A dual formulation of the multi-class SVM can be obtained by thestandard technique of Lagrange multipliers:

maxα

2∑

i ,m

αim −∑

i ,j ,m

[

1

21yi=yjAiAj − αimαjyi +

1

2αimαjm

]

(xi · xj)

subject to:N∑

i=1

αim =N∑

i=1

1yi=ymAi , m = 1, . . . ,M

0 ≤ αim ≤ C , αiyi = 0

where Ai =∑M

m=1 αim.


Multi-Class SVM: Summary

Generalized notion of the margin: the difference in classification score tothe example with the nearest classification score.

A single optimization problem accounts for interaction between variousclasses.

Kernelization is straightforward.

/ The size of the optimizaiton problem is multiplied by the number ofclasses.


Structured Output Learning: Preliminaries

Output values y belong to a discrete space Y (e.g., parse trees).

Goal: given the input {(x1, y1), . . . , (xN , yN )}, find a functionf : X → Y, which represents the dependence of y on x.





How can we define a function with the range in Y?





How can we define a function with the range in Y?

Detour: Suppose we could learn some other function F : X × Y → R.Then we can define the learning function as:

f = argmaxy∈Y

F (x, y)


Structured Output Learning: Problem Setup

Assume F to be linear in some joint space of inputs and outputs:

F (x, y) = 〈w,Ψ(x, y)〉




F (x, y) = 〈w,Ψ(x, y)〉

For each example, define its margin as the prediction difference to some“runner-up” example:

Mi = 〈w,Ψ(xi , yi )〉 − maxy∈Y\yi

〈w,Ψ(xi , y)〉




F (x, y) = 〈w,Ψ(x, y)〉

For each example, define its margin as the prediction difference to some“runner-up” example:

Mi = 〈w,Ψ(xi , yi )〉 − maxy∈Y\yi

〈w,Ψ(xi , y)〉

Define the learning problem as weight minimization subject to marginconstraints:

minw

1

2||w||2

subject to: Mi ≥ 1, ∀i


Re-inventing the Wheel: Multi-Class SVM

The single-problem multi-class SVM is a special case of structured outputlearning with the following definitions:

w = [w1, . . . ,wM ]

Ψ(x, y) = [1y=1x, . . . 1y=M , x]

||w||2 =∑M

m=1(wm ·wm)

Mi = 〈wyi , xi 〉 − maxm 6=yi

〈wym , xi 〉


Non-trivial Example: Learning to Parse

For strongly structured problems, fixing point-wise margins to 1 (i.e.,requiring theMi ≥ 1 may be too rigid.

Solution: introduce the loss function ∆(y, yi ) and change point-wisemargins to:

Mi ≥ 1−ξi

∆(y, yi )


Algorithmics: The Devil Lies in Detail

How can we solve optimization problems in the form

minw

1

2||w||2

subject to: 〈w,Ψ(xi , yi )〉 − maxy∈Y\yi

〈w,Ψ(xi , y)〉 ≥ 1, i = 1, . . . ,N ?

Algorithmic challenges:

Constraints are rather complex (convex but non-differentiable).

Previous solution – replacing each constraint by |Y| ones – does notwork for infinite Y.


Sketch of the Algorithm

Main idea: de-couple optimization from finding the max in each of the Nconstraints.

Algorithm 1 Structured Output SVM

input {(x1, y1), . . . , (xN , yN)}, Ψ, ǫS ← ∅repeat

for i = 1, . . . ,N do

y← argmaxy∈Y\yi

〈w,Ψ(xi , y)〉

S ← S ∪ (xi , y)end for

(w, ξ)← solution to QP only with constraints from Suntil S does not change anymore


Solving the argmax Problems

Recall that the argmax “comes from the classification stage” andcorresponds to finding the most likely output, given a certain model, for agiven input.

Problem-specific solutions:

Multi-class classification: trivial, O(M).

Label sequence learning: HMM prediction problem, Viterbi algorithm(dynamic programming), O(|y|2).

Sequence alignment: Smith-Waterman algorithm, O(|x|2).

Learning to parse: CKY-parser, O(|x|3 · |G |), where |G | is the size of thegrammar.


Complexity Analysis

Theorem 1 (Tsochantaridis et al.)

The Structured Output SVM reaches precision ǫ after at most

max

{

2N∆

ǫ,8C∆3R2

ǫ2

}

iterations, where

∆ = maxi

maxy

∆(y, yi )

R = maxi

maxy||Ψ(xi , yi )−Ψ(xi , y)||


Complexity Analysis

Proof sketch.The main idea is to lower-bound the progress in objective function by optimizing alongthe direction corresponding to the added example. It follows that, by re-optimizing thewhole working set, at least as much can be reached.

The increase of the objective along some direction η can be bounded as:

max0<β≤D

{Θ(α0 + βη)} −Θ(α0) ≥1

2min

{

D ,

⟨

∇Θ(α0), η⟩

ηTJη

}

⟨

∇Θ(α0), η⟩

By fixing the direction η to be er (unit vector of the added example r), we have:

max0<β≤D

{Θ(α0 + βer )} −Θ(α0) ≥1

2min

{

D ,

∂Θ∂αr

(α0)

Jrr

}

∂Θ

∂αr

(α0)

The result follows from substituting specific values for Jrr and ∂Θ∂αr

(α0) and inverting thebound.


Performance Measures in Machine Learning

Confusion matrix:

Truelabels

Assigned labels+1 −1

+1 TP FN−1 FP TN

Classification error:

E =FP + FN

TP + FP + TN + FN

Neyman-Pearson errors:

FNR =FN

TP + FN, FPR =

FP

FP + TN

Precision/Recall:

P =TP

TP + FP, R =

TN

TN + FN


Performance Measures in Machine Learning (ctd.)

Receiver Operating Characteristic (ROC) and Precision/Recall curves:

Area under Curve (AUC, prAUC): untegral measures of ROC or PR-curves

F1-measure:

F1 =2PR

P + R

Precision at k : precision with exactly k positive predictions.


Optimizing Performance Measures

Traditional learning methods optimize the classification error.

Some performance measures are linear, and hence easy to optimize:

⇒ classification error⇒ Neyman-Pearson errors

Others are nonlinear...

⇒ precision/recall⇒ F1-measure

...or multivariate

⇒ ROC-based measures depend on the ranking of all data points


Optimizing Performance Measures with SO-SVM

Main idea: instead of learning a single-valued function f : X → Y,consider the problem of learning all labels on all data points:

f : X = XN −→ Y = {−1,+1}N

Define the mapping

Ψ(x, y) =N∑

i=1

yixi

Learn the linear discriminative function of structured output SVM:

fw(x) = argmaxy∈Y

〈w,Ψ(x, y)〉


Algorithmic Implementation



All we need to care about is the argmax problem!




For a problem with N training points there are at most O(N2) differentconfusion matrices.

⇒ For performance measures based on a confusion matrix, the argmax can becomputed explicitly in O(N2) time.




For a problem with N training points there are at most O(N2) differentconfusion matrices.

⇒ For performance measures based on a confusion matrix, the argmax can becomputed explicitly in O(N2) time.

The number of swapped labels can be computed in O(N logN) bysorting continuous classification scores:

⇒ For performance measurse based on a ROC, the argmax can be computed inO(N logN) time.


Summary

The framework of structured output learning allows one to extend theideas of kernels methods to a large number of applications.

Algorithmics of structured output learning are based on alternation ofoptimization and the applying the classification function (argmaxproblem).

Using structured output learning, one can directly optimize variousperformance measures for binary classification problems.


Bibliography I

[1] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems viaerror-correcting output codes. Journal of Artificial Intelligence Research,2:263–286, 1995.

[2] T. Joachims. A support vector method for multivariate performancemeasures. In International Conference on Machine Learning (ICML), pages377–384, 2005.

[3] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large marginmethods for structured and interdependent output variables. Journal ofMachine Learning Research, 6:1453–1484, 2005.

[4] J. Weston and C. Watkins. Support vector machines for multi-class patternrecognition. In European Symposium on Artifician Neural Networks(ESANN), pages 219–224, 1999.


lecture 10 kernel methods for structured outputs · lecture 10 kernel methods for structured...

Documents