lecture 10 kernel methods for structured outputs · lecture 10 kernel methods for structured...

32
Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universit¨ at T¨ ubingen, Germany Advanced Topics in Machine Learning, 2012 P. Laskov and B. Nelson (T¨ ubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 1 / 24

Upload: others

Post on 01-Jan-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Lecture 10Kernel Methods for Structured Outputs

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems GroupWilhelm Schickard Institute for Computer ScienceUniversitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 1 / 24

Page 2: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

What is Structured Outputs?

Multiclass classification

Parsing

�� �������

� �������

� �

Syntactic alignment

������������������� ���������������������� �����������������

������� ����������������������������������� ���!�� ����������� �� ������

� �

������

�������������

������

��������������

�� ��������

���������

������� ������������������������������������� ���!�� ����������� ���� ������

(�)!��������%�����&����

Label sequence learning

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 2 / 24

Page 3: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Multi-Class Classification: One-vs-the-Rest

Advantages:

Small number of classifiers

Rejection possible

Disadvantages:

Complex classifier boundaries

Unbalanced classification problems

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 3 / 24

Page 4: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Pairwise Multi-Class Classification

Disadvantages:

Large number of classifiers

Rejection not possible

Advantages:

Simple classifier boundaries

Balanced classification problems

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 4 / 24

Page 5: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Multi-Class Classification: Distributed Encoding

Encode classes with binary words withsufficient separation between them.

Train binary classifiers corresponding tothe columns of the code matrix.

Apply all classifiers at the decision stage,decide by majority vote.

If the minimal Hamming distance betweencode words is d then ⌊d/2⌋ errors can becorrected.

code wordclass 1 2 3 4 5

1 0 0 0 1 12 0 0 1 0 13 0 1 0 0 14 0 1 1 0 05 1 0 0 1 06 1 1 0 1 17 1 1 1 0 18 1 1 1 1 0

Advantages:

Small number of classifiers

Balanced classification problems

Disadvantages:

Comples classifier boundaries

Rejection not possible

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 5 / 24

Page 6: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Multi-Class SVM

Consider the decision function of a one-vs-the-rest classifier:

f (x) = argmaxm∈{1,...,M}

fm(x)

Key idea: build this constraint into the training problem for eachexample with known label:

minw

1

2

M∑

m=1

(wm · wm)

subject to: (wyi · xi) + byj ≥ maxm 6=yi

(wm · xi) + bm, ∀i

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 6 / 24

Page 7: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Multi-Class SVM (ctd.)

The training problem can be adjusted to “soft-margin” and transformedinto the following form:

minw ,ξ

1

2

M∑

m=1

(wm · wm) + C

N∑

i=1

m 6=yi

ξim

subject to: (wyi · xi ) + byi ≥ (wm · xi ) + bm + 2− ξim, ∀i ,m 6= yi

ξim ≥ 0, ∀i ,m 6= yi

Single quadratic problem with 2N(M − 1) linear constraints.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 7 / 24

Page 8: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Multi-Class SVM (ctd.)

A dual formulation of the multi-class SVM can be obtained by thestandard technique of Lagrange multipliers:

maxα

2∑

i ,m

αim −∑

i ,j ,m

[

1

21yi=yjAiAj − αimαjyi +

1

2αimαjm

]

(xi · xj)

subject to:N∑

i=1

αim =N∑

i=1

1yi=ymAi , m = 1, . . . ,M

0 ≤ αim ≤ C , αiyi = 0

where Ai =∑M

m=1 αim.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 8 / 24

Page 9: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Multi-Class SVM: Summary

Generalized notion of the margin: the difference in classification score tothe example with the nearest classification score.

A single optimization problem accounts for interaction between variousclasses.

Kernelization is straightforward.

/ The size of the optimizaiton problem is multiplied by the number ofclasses.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 9 / 24

Page 10: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Structured Output Learning: Preliminaries

Output values y belong to a discrete space Y (e.g., parse trees).

Goal: given the input {(x1, y1), . . . , (xN , yN )}, find a functionf : X → Y, which represents the dependence of y on x.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 10 / 24

Page 11: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Structured Output Learning: Preliminaries

Output values y belong to a discrete space Y (e.g., parse trees).

Goal: given the input {(x1, y1), . . . , (xN , yN )}, find a functionf : X → Y, which represents the dependence of y on x.

How can we define a function with the range in Y?

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 10 / 24

Page 12: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Structured Output Learning: Preliminaries

Output values y belong to a discrete space Y (e.g., parse trees).

Goal: given the input {(x1, y1), . . . , (xN , yN )}, find a functionf : X → Y, which represents the dependence of y on x.

How can we define a function with the range in Y?

Detour: Suppose we could learn some other function F : X × Y → R.Then we can define the learning function as:

f = argmaxy∈Y

F (x, y)

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 10 / 24

Page 13: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Structured Output Learning: Problem Setup

Assume F to be linear in some joint space of inputs and outputs:

F (x, y) = 〈w,Ψ(x, y)〉

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 11 / 24

Page 14: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Structured Output Learning: Problem Setup

Assume F to be linear in some joint space of inputs and outputs:

F (x, y) = 〈w,Ψ(x, y)〉

For each example, define its margin as the prediction difference to some“runner-up” example:

Mi = 〈w,Ψ(xi , yi )〉 − maxy∈Y\yi

〈w,Ψ(xi , y)〉

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 11 / 24

Page 15: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Structured Output Learning: Problem Setup

Assume F to be linear in some joint space of inputs and outputs:

F (x, y) = 〈w,Ψ(x, y)〉

For each example, define its margin as the prediction difference to some“runner-up” example:

Mi = 〈w,Ψ(xi , yi )〉 − maxy∈Y\yi

〈w,Ψ(xi , y)〉

Define the learning problem as weight minimization subject to marginconstraints:

minw

1

2||w||2

subject to: Mi ≥ 1, ∀i

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 11 / 24

Page 16: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Re-inventing the Wheel: Multi-Class SVM

The single-problem multi-class SVM is a special case of structured outputlearning with the following definitions:

w = [w1, . . . ,wM ]

Ψ(x, y) = [1y=1x, . . . 1y=M , x]

||w||2 =∑M

m=1(wm ·wm)

Mi = 〈wyi , xi 〉 − maxm 6=yi

〈wym , xi 〉

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 12 / 24

Page 17: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Non-trivial Example: Learning to Parse

For strongly structured problems, fixing point-wise margins to 1 (i.e.,requiring theMi ≥ 1 may be too rigid.

Solution: introduce the loss function ∆(y, yi ) and change point-wisemargins to:

Mi ≥ 1−ξi

∆(y, yi )

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 13 / 24

Page 18: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Algorithmics: The Devil Lies in Detail

How can we solve optimization problems in the form

minw

1

2||w||2

subject to: 〈w,Ψ(xi , yi )〉 − maxy∈Y\yi

〈w,Ψ(xi , y)〉 ≥ 1, i = 1, . . . ,N ?

Algorithmic challenges:

Constraints are rather complex (convex but non-differentiable).

Previous solution – replacing each constraint by |Y| ones – does notwork for infinite Y.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 14 / 24

Page 19: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Sketch of the Algorithm

Main idea: de-couple optimization from finding the max in each of the Nconstraints.

Algorithm 1 Structured Output SVM

input {(x1, y1), . . . , (xN , yN)}, Ψ, ǫS ← ∅repeat

for i = 1, . . . ,N do

y← argmaxy∈Y\yi

〈w,Ψ(xi , y)〉

S ← S ∪ (xi , y)end for

(w, ξ)← solution to QP only with constraints from Suntil S does not change anymore

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 15 / 24

Page 20: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Solving the argmax Problems

Recall that the argmax “comes from the classification stage” andcorresponds to finding the most likely output, given a certain model, for agiven input.

Problem-specific solutions:

Multi-class classification: trivial, O(M).

Label sequence learning: HMM prediction problem, Viterbi algorithm(dynamic programming), O(|y|2).

Sequence alignment: Smith-Waterman algorithm, O(|x|2).

Learning to parse: CKY-parser, O(|x|3 · |G |), where |G | is the size of thegrammar.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 16 / 24

Page 21: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Complexity Analysis

Theorem 1 (Tsochantaridis et al.)

The Structured Output SVM reaches precision ǫ after at most

max

{

2N∆

ǫ,8C∆3R2

ǫ2

}

iterations, where

∆ = maxi

maxy

∆(y, yi )

R = maxi

maxy||Ψ(xi , yi )−Ψ(xi , y)||

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 17 / 24

Page 22: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Complexity Analysis

Proof sketch.The main idea is to lower-bound the progress in objective function by optimizing alongthe direction corresponding to the added example. It follows that, by re-optimizing thewhole working set, at least as much can be reached.

The increase of the objective along some direction η can be bounded as:

max0<β≤D

{Θ(α0 + βη)} −Θ(α0) ≥1

2min

{

D ,

∇Θ(α0), η⟩

ηTJη

}

∇Θ(α0), η⟩

By fixing the direction η to be er (unit vector of the added example r), we have:

max0<β≤D

{Θ(α0 + βer )} −Θ(α0) ≥1

2min

{

D ,

∂Θ∂αr

(α0)

Jrr

}

∂Θ

∂αr

(α0)

The result follows from substituting specific values for Jrr and ∂Θ∂αr

(α0) and inverting thebound.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 17 / 24

Page 23: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Performance Measures in Machine Learning

Confusion matrix:

Truelabels

Assigned labels+1 −1

+1 TP FN−1 FP TN

Classification error:

E =FP + FN

TP + FP + TN + FN

Neyman-Pearson errors:

FNR =FN

TP + FN, FPR =

FP

FP + TN

Precision/Recall:

P =TP

TP + FP, R =

TN

TN + FN

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 18 / 24

Page 24: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Performance Measures in Machine Learning (ctd.)

Receiver Operating Characteristic (ROC) and Precision/Recall curves:

Area under Curve (AUC, prAUC): untegral measures of ROC or PR-curves

F1-measure:

F1 =2PR

P + R

Precision at k : precision with exactly k positive predictions.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 19 / 24

Page 25: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Optimizing Performance Measures

Traditional learning methods optimize the classification error.

Some performance measures are linear, and hence easy to optimize:

⇒ classification error⇒ Neyman-Pearson errors

Others are nonlinear...

⇒ precision/recall⇒ F1-measure

...or multivariate

⇒ ROC-based measures depend on the ranking of all data points

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 20 / 24

Page 26: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Optimizing Performance Measures with SO-SVM

Main idea: instead of learning a single-valued function f : X → Y,consider the problem of learning all labels on all data points:

f : X = XN −→ Y = {−1,+1}N

Define the mapping

Ψ(x, y) =N∑

i=1

yixi

Learn the linear discriminative function of structured output SVM:

fw(x) = argmaxy∈Y

〈w,Ψ(x, y)〉

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 21 / 24

Page 27: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Algorithmic Implementation

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 22 / 24

Page 28: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Algorithmic Implementation

All we need to care about is the argmax problem!

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 22 / 24

Page 29: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Algorithmic Implementation

All we need to care about is the argmax problem!

For a problem with N training points there are at most O(N2) differentconfusion matrices.

⇒ For performance measures based on a confusion matrix, the argmax can becomputed explicitly in O(N2) time.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 22 / 24

Page 30: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Algorithmic Implementation

All we need to care about is the argmax problem!

For a problem with N training points there are at most O(N2) differentconfusion matrices.

⇒ For performance measures based on a confusion matrix, the argmax can becomputed explicitly in O(N2) time.

The number of swapped labels can be computed in O(N logN) bysorting continuous classification scores:

⇒ For performance measurse based on a ROC, the argmax can be computed inO(N logN) time.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 22 / 24

Page 31: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Summary

The framework of structured output learning allows one to extend theideas of kernels methods to a large number of applications.

Algorithmics of structured output learning are based on alternation ofoptimization and the applying the classification function (argmaxproblem).

Using structured output learning, one can directly optimize variousperformance measures for binary classification problems.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 23 / 24

Page 32: Lecture 10 Kernel Methods for Structured Outputs · Lecture 10 Kernel Methods for Structured Outputs Pavel Laskov1 Blaine Nelson1 1Cognitive Systems Group Wilhelm Schickard Institute

Bibliography I

[1] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems viaerror-correcting output codes. Journal of Artificial Intelligence Research,2:263–286, 1995.

[2] T. Joachims. A support vector method for multivariate performancemeasures. In International Conference on Machine Learning (ICML), pages377–384, 2005.

[3] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large marginmethods for structured and interdependent output variables. Journal ofMachine Learning Research, 6:1453–1484, 2005.

[4] J. Weston and C. Watkins. Support vector machines for multi-class patternrecognition. In European Symposium on Artifician Neural Networks(ESANN), pages 219–224, 1999.

P. Laskov and B. Nelson (Tubingen) Lecture 10: Learning with Structured Outputs July 10, 2012 24 / 24