structured prediction - cambridge machine learning...

1

Structured Prediction

(with Application in Information Retrieval)

Thomas HofmannGoogle, Switzerland

[email protected]

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

2

1 Structured Classification

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

3

Supervised Machine Learning

Learn functional dependencies between inputs and outputs fromtraining data (inductive inference)Most basic form: classification

uoptical character

recognition

document categorization

M13 = MONEY MARKETS

M132 = FOREX MARKETS

MCAT = MARKETS

UK = United Kingdom

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

4

Automatic Document Categorization

document categorization M13 = MONEY MARKETS


MCAT = MARKETS

UK = United Kingdom



training examples

expertlearning machine

/* some ‘complicated’ algorithm */

inductiveinference

advantages: implicit knowledge (in action) no knowledge engineering fully automatic & adaptive human-level accuracy

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

5

Outputs are either

• Structured objects such as sequences,strings, trees, etc.

• Multiple response variables that areinterdependent (e.g. dependenciesmodeled by probabilistic graphicalmodels) = collective classification

Structured Classification

Generalize classification methods to deal with structured outputsand/or with multiple, interdependent outputs

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

6

r

Part-Of-Speech Tagging

INPUTProfits soared at Boeing Co., easily toppingforecasts on Wall Street, as their CEO AlanMulally announced first quarter results.

OUTPUTProfits/N soared/V at/P Boeing/N Co./N ,/,easily/ADV topping/V forecasts/N on/PWall/N Street/N ,/, as/P their/POSS CEO/NAlan/N Mulally/N announced/V first/ADJquarter/N results/N ./.

N = nounV = verbP = prepositionADV = adverb...

[Example taken from Michael Collins]

Challenge:Predict a tag sequence

• R. Garside, The CLAWS Word-tagging System. In: R. Garside, G. Leech and G. Sampson (eds), The ComputationalAnalysis of English: A Corpus-based Approach. London: Longman, 1987.

• Y.Altun, I. Tsochantaridis, T. Hofmann, Hidden Markov Support Vector Machines, ICML, 2003

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

7

Information Extraction

The outcome of a possible The outcome of a possible <organization><organization> WTO WTO</organization></organization> cat fight is far from clear. Even if the cat fight is far from clear. Even if the<country><country> United States United States </country></country> won, nothingwon, nothingmight change, said might change, said <person><person> Richard Richard AboulafiaAboulafia</person></person> , aviation analyst with the , aviation analyst with the <organization><organization>Teal Group Teal Group </organization></organization>, an industry consulting, an industry consultingfirm in firm in <location><location> Fairfax, Va. Fairfax, Va. </location></location>

Challenge:Predict a labeled segmentation

• J. Lafferty, F. Pereira, and A. McCallum. Conditional random fields: Probabilistic models for segmenting andlabeling sequence data. ICML, 2001


Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

8

Natural Language Parsing

Syntactical analysis of natural language,e.g. based on a context-free grammar

Challenge:Predict a labeled parse tree

• M Collins, Discriminative Reranking for Natural Language Parsing, ICML 2000.• B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, Max-Margin Parsing, EMNLP, 2004.• I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support Vector Machine Learning for Interdependent and

Structured Output Spaces, ICML 2004.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

9

Optical Character Recognition

Recognize handwritten wordsTr

aini

ng s

et

Challenge:Predict a character sequence

• B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. NIPS, 2003.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

10

r

Visual Scene Analysis

Recognize objects in a cameraimage or video

• A. Torralba, K. Murphy and W. Freeman, Contextual Models for Object Detection using Boosted Random Fields,NIPS 2004.

[Example taken from Torralba et al.]

Challenge:Predict a scene graph

Challenge:Recognize objects in context

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

11

Protein Structure Prediction

Predict protein secondary and tertiary structure from its primarystructure

… AIVVTQHGRVTSVAAQHHWATAIAVLGRPK …

… BBBBBBB----AAAAAAAAAAAAAA ----- …

… AIVVTQHGRVTSVAAQHHWATAIAVLGRPK …

… BBBBBBB----AAAAAAAAAAAAAA ----- …

• Y. liu, E. P. Xing, and J. Carbonell, Predicting protein folds with structural repeats using a chain graph model,ICML 2005

Challenge:Predict labeled segmentation

Challenge:Binary contact matrix

2 Multiclass Classification

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

13

Linear Multiclass Classification

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

14

Linear Multiclass Classification - Geometry

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

15

Multiclass Perceptron

• Michael Collins, Discriminative training methods for hidden Markov models: theory and experiments withperceptron algorithms, EMNLP 2002

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

16

Multiclass SVMs: Separation Margin

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

17

Multiclass SVM: Separation Margin, Illustrated

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

18

Multiclass SVMs: Formulation

• Koby Crammer, Yoram Singer, On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines,Journal of Machine Learning Research, 2001.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

19

From Multiclass to Structured Prediction

• Thorsten Joachims, Thomas Hofmann, Yisong Yue, Chun-Nam Yu, Predicting Structured Objects with Support VectorMachines, Communications of the ACM, 2009 (to appear).

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

20

From Multiclass to Structured Prediction

1. Compact Representation: Define joint input-output featurefunctions and/or kernels that ensure generalization capabilitiesacross inputs and outputs.

2. Efficient search: Develop problem or representation specificsearch algorithms, e.g. based on dynamic programming, minflow, etc., to find optimal output for given input

3. Refined error metric: Integrate application-specific lossfunction into problem formulation

4. Training algorithm: Use efficient output search in conjunctionwith (sparse) cutting plane optimization method

3 Function Classes forStructured Classification

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

22

Linear compatibilty functions in joint feature representationof input/output pairs

Prediction

Binary classification

Compatibility Functions

encodes combined properties ofinputs and outputs:which aspect of the input is relevantfor which part of the output?

for given input predict mostcompatible output (may involvesearch)

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

23

Structured inputs are usually handled with appropriate featureextraction methods, i.e. for input space X, find a feature map:

With the use of kernels, the feature map can be implicit and isnot restricted to finite d.

Examples: String kernels [LSST+02], [LK04], [VS03]: number of commonsubsequences (with gaps) Tree kernels and convolutional kernels [CD01] Graph kernels [GLF02], [Gär03] Set kernels [KJ03] Research topics: modeling power, applications &experimental investigations, computational improvements

Kernels for Discrete Inputs

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

24

Joint input/output features

Joint Feature Map: Information Extraction... IN REALITY TOUCH PANEL SYSTEMS CAPITALIZED AT ...

the word “systems” occurring in a company name segment

the word “systems” occurring at the end of a company name

the word “capitalized” following a company name segment

a location segment following the word “in” (how often?)

a location segment being followed by a company segment

a company name segment consisting of 3 words

COMPANYCOMPANY

COMPANY LOCLOC...

1

1

1

2

2

2

3

3 4

4

6

6

5

5

• Y.Altun, I. Tsochantaridis, T. Hofmann, Hidden Markov Support Vector Machines, ICML, 2003• S. Sarawagi & W. W. Cohen, Semi-Markov Conditional Random Fields for Information Extraction, NIPS 2004

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

25

Joint Feature Maps: Parsing

• I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support Vector Machine Learning for Interdependent andStructured Output Spaces, ICML 2004.

• B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, Max-Margin Parsing, EMNLP, 2004.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

26

Tensor Product Feature Maps


Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

27

Sequence Feature Maps

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

28

Reproducing Kernel Hilbert Spaces

compatibility functionsover inputs and outputs

generating kernel

RKHS

expansioncoefficients

inner product/norm

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

29

Special case: kernels defined via inner products

Tensor product feature maps

Kernels serve two purposes: flexible and efficient featureextraction, efficient combination of input/output features

RKHS and Finite-dimensional Vector Spaces

linear functions in some finitedimensional featurerepresentation as special case

finite-dimcase

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

30

4 Statistical Inference forStructured Prediction

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

31

Regularization & Representer Theorem

Combine risk function with appropriate regularization functional(Hilbert space norm)

Representer Theorem

• B. Schölkopf, R. Herbrich, A. Smola, and R. C. Wiliamson, A Generalized Representer Theorem, In Proceedings ofthe Annual Conference on Computational Learning Theory, pages 416-426, 2001.

• T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals ofStatistics, 2008

trainingsample

stabilizer(penalize norm)

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

32

Exponential Family Conditional Models

In order to derive learning/inference algorithms, a probabilisticinterpretation of compatibility functions is fruitful.

Define family of conditional distributions

higher compatibility, higher conditional probability

logpartitionfunction

sufficient statistics (given x)canonical parameters

• B. O. Koopman. On distributions admitting a sufficient statistic. Trans. of the American Math. Society, 39, 1936.• M. K. Murray and J.W. Rice. Differential geometry and statistics. Chapman and Hall, 1993.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

33

(Negative) Conditional log-likelihood

Risk Functions: Log-Loss

maximize compatibility on training pairs

minimize log-partition function

|f(xi,y’’)

|f(xi,y’’’)

|f(xi,y’)

|f(xi,yi)

Conditional Random Field(CRF)


• S. Kakade, Y.W. Teh, and S. Roweis. An alternative objective function for Markovian fields. ICML, 2002.• Y. Altun. M. Johnson, and T. Hofmann, Investigating Loss Functions and Optimization Methods for Discriminative

Learning of Label Sequences, EMNLP, 2003.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

34

Parametric case

Gradient-based Methods

Numerical optimization• Iterative scaling (not recommended)• Preconditioned conjugate gradient• Limited Memory Quasi-Newton (BFGS)• Bayesian CRFs (avoid overfitting)

Learning CRFs


• H. M. Wallach. Efficient training of conditional random fields. Master’s thesis, University of Edinburgh, 2002.• Y. Qi, M. Szummer, T. Minka, Bayesian Conditional Random Fields, AISTATS, 2005.

expectation over outputsunder current conditional model

exact inference (e.g. forward-backward,junction tree) or approximate inference

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

35

Regularization & Representer Theorem

Representer Theorem (again)

sum can be huge, if output space is combinatorial – sparseness is crucial

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

36

Minimum Log-Odds

Hinge Loss

Risk Functions: Hinge Risk

difference between compatibility of correct vs. best incorrect output

1

differences of 1 and greaterare ok, otherwise linear penalty

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

37

Minimize Hinge loss = structured Support Vector Machines

PrimalQP

SVMstruct: Problem Formulation


number of constraints =cardinality of output space

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

38

Soft Margin: Illustration

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

39

Dual problem formulation

DualQP

SVMstruct: Dual Formulation

Block structure:Independent constraintsfor every training example

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

40

SVMstruct: Algorithm

Constraint generation

Adding constraint

Re-optimization

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

41

SVMstruct: Variable Selection

• Incrementally add constraints (sequential strengthening of QP)• Corresponds to variable selection in the dual QP• Sequentially optimize dual over selected variables

• Select maximally violating constraints (or at least є-violatingconstraints) – requires black box search to find best or secondbest outputs

Theorem:

After selecting at most є-violating constraints no

remaining constraint is violated by more than є.

• I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and InterdependentOutput Variables, JMLR 2005.

• http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

42

SVMstruct: Proof of theorem (sketch)

• Dual objective is upper bounded by primal, which is upperbounded by nC (zero weight vector).

• We show that each variable selections for an ε-violated constraintwill increase the dual objective by some minimum (bounded awayfrom zero).

• This limits the number of sequentially ε-violated constraints.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

43

SVMstruct: Support Outputs

• SVM: sparseness = select some examples• SVMstruct: sparseness = select some outputs for some examples

• Example: named entity tagging (Spanish corpus)

PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA MERIDA ( EFE ) .O N N N M m m N N N O L N O N N

ONNNMmmNNNOLNONN--------M------- Support Vector Sequences------N--------- (only difference to correct sequence shown)----------P---------N-----------N---P----------------------o----


Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

44

5 Loss Functions

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

45

Problem-Specific Loss Functions

Problem-specific prediction loss functions

This is important to model quality of predictions indomain/application-dependent way.

Example: Hamming loss

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

46

Margin Re-scaling

We are requiring different margins for different outputs

This results in a modified Hinge loss

• B. Taskar, C. Guestrin and D. Koller, Maximum Margin Markov Networks, NIPS 2003.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

47

Dual problem formulation

DualQP

SVMstruct: Dual for Margin Re-scaling

Minor modification in boundconstraints

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

48

Black Box: Constraint Selection

Without margin re-scaling, we need a blackbox to solve:

With margin re-scaling, this becomes:

Requirements:• Search (argmax) needs to be feasible• Second best output may be required• Loss function needs to decompose in similar manner

Many tractable use cases: dynamic programming, min-cut, etc.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

49

6 Compositionality andGraphical Models

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

50

Compositionality

Additive decomposition of compatibility function over “parts“

Induces decompositionality of kernel functions (proof elementary)

dependency on subset of variables

additivityover parts

• T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals ofStatistics, 2008

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

51

Markov Networks

Conditional independence graph

• distributions with full support pairwise = local = global Markov property

Markov Networks conditional independencebetween variables corresponds tograph separation

product over clique potential functions(Hamersley-Clifford)

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

52

Markov Network Kernels: Example

Label sequence learning with HMM structure

Overlapping input features

y

y’

u

u’

y

x

u

x’

y y’

x

u u’

x’

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

53

Representer Theorem over Cliques

• Standard representer theorem yields expansion with terms

• SVMstruct: sparseness due to Hinge loss properties• Logistic loss: no sparseness, too many parameters

Clique-based Sparse Approximation

problematic

• B. Taskar, C. Guestrin and D. Koller, Maximum Margin Markov Networks, NIPS 2003.• J. Lafferty, X. Zhu, and Y. Liu, Kernel conditional random fields: Representation and clique selection, ICML, 2004.• Y. Altun, A. Smola, and T. Hofmann, Exponential Families for Conditional Random Fields, UAI 2004.• T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals of

Statistics, 2008

For decomposable RKHS:

much fewer variables (further: parameter tying)

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

54

SVMstruct with Approximate Inference

• Undergenerating methods

• Examples: Greedy, loopy belief propagation• Theoretical analysis: convergence, bounded (but not arbitrary small)

overall approximation quality, if quality of approximation can bebounded

• Overgenerating methods

• Examples: LP relaxation (leading to solutions over {0,1/2,1}) [Boros &Hammer 2002] & Quadratic pseudo-Boolean optimization using graph cuts[Kolmogorov & Rother 2004]

• Use fractional numbers (1/2) in constraints -> overly constrainedproblem, but feasible solution will be found

• T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. ICML 2008.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

55

SVMstruct with Approximate Inference

• Evaluation on multi-labeled classification = Multiplecoupled/overlapping binary classification problems• Dependencies modeled as a binary MRF• Relaxation methods (overgenerating) work best for training.• Adding fractional constraints during training helps avoid fractionalsolutions in inference.

• T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. ICML 2008.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

56

7 Applications

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

57

7.1 Named EntityClassification

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

58

Application: Named Entity Classification

• Goal: finding names in sentences• Label set: beginning and continuation

of person, location, organization andmiscellaneous names and non-name

• Data set: Spanish newswire corpus,CoNLL 2002

• Input features: word identity andspelling properties of current,previous, and next observation

The outcome of a possibleThe outcome of a possible<organization><organization> WTO WTO </organization></organization>cat fight is far from clear. Even if thecat fight is far from clear. Even if the<country><country> United States United States </country></country>won, nothing might change, saidwon, nothing might change, said<person><person> Richard Richard AboulafiaAboulafia </person></person>, aviation analyst with the, aviation analyst with the<organization><organization> Teal GroupTeal Group</organization></organization>, an industry consulting, an industry consultingfirm in firm in <location><location> Fairfax, Va. Fairfax, Va.</location></location>

Named Entity Classification

0

2

4

6

8

10

Erro

r %

Error 9,36 5,62 5,17 5,94 5,08

HMM CRF CRF-B HM-PC HM-SVM

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

59

7.2 HierarchicalClassification

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

60

Taxonomies: IPC

D: Textiles; Paper

D01: Natural orartificial threads orfibres; Spinning

D02: Yarns; Warpingor Beaming; ...

D03: Weaving

D04: Braiding; LaceMaking; Knitting; ...

D06: Treatment ofTextiles; ...

D05: Sewing;Embroidering; Tufting

D07: Ropes; ...

D21: Paper; ...

D03C: Shedding mechanisms;Pattern cards or chains; Punchingof cards; Designing patterns

D03D: Woven fabrics;Methods of weaving;Looms

D03J:Auxiliary weavingapparatus; Weavers’ tools;Shuttles

A: Human Necessities

B: Performing Operations; Transporting

C: Chemistry; MetallurgyE: Fixed Construction

F: Mechanical Engineering

G: Physics

H: Electricity

• World Intellectual Property Organization. International patent classification. URL, 2001.http://www.wipo.int/classifications/en/

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

61

Class Attributes from Taxonomies

Hierarchical Classification with Taxonomies

9 10

6 7 8

1 2 3 4 5

similarity function over categories(generalization across categories, notjust across inputs)

• T. Hofmann, L. Cai, and M. Ciaramita, Learning with Taxonomies: Classifying Documents and Words, Workshop onSyntax, Semantics, and Statistics, NIPS, 2003

special case, multiclass = orthogonalclass representation

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

62

Hierarchical Classification

Model• Input feature map: vector space

representation of documents• Class attributes from IPC taxonomy• (Tree-specific loss function)• Soft-margin objective and SVMstruct –

explained later!

0

0,5

1

1,5

2

%

SVMflat

SVMstruct

SVMflat 1,941 1,323

SVMstruct 1,822 1,228

3samp full

Experiments• WIPO-alpha: patent database

(2002)• Based on IPC taxonomy• Trained on different sections, full

and subsampled (3 trainingexamples per leaf category)

• L. Cai and T. Hofmann, Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

63

7.3 Binary Classificationwith Unbalanced Classes

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

64

Unbalanced Classes in Text Categorization

• Binary text categorization: only very few positive examples (<1%)• Classification error = poor loss function (99%+ by trivial classification)

• More suitable error metrics:• F1 score = harmonic mean of precison and recall)• Precision-recall breakeven point

• Problem: functions of sets of (not individual) predictions ->structured classification

• Representation (reduces to standard classification for 0/1 loss)

• Constraint generation/prediction is efficient for metrics thatdepend on contingency table of prediction errors

• T. Joachims. A support vector method for multivariate performance measures. ICML 2005.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

65

Experiments

• Comparison on 4 data sets• Plain binary SVM classification vs. SVMs trained on specific loss

• T. Joachims. A support vector method for multivariate performance measures. ICML 2005.

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

66

7.4 Learning to Rank

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

67

Optimizing MAP with SVMstruct

• Loss function = 1 - MAP (mean average precision)• Inputs q = search query• Outputs y = rankings of documents/URLs (matrix of pairwise

orderings)• Training data: set of relevant and irrelevant documents per query

• Efficient constraint generation: exploit fact that MAP onlydepends on label sequence

• Yisong Yue, T. Finley, F. Radlinski, T. Joachims, A Support Vector Method for Optimizing Average Precision. SIGIR2007.

Comparison w/Indri functions(Lemur project)

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

68

7.5 OptimizingSearch Diversity

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

69

Search Result Diversity

• Traditionally:• Assess relevance of a document on ordinal scale (e.g. binary)• Quality of ranking = only depends on sequence of relevance scores (e.g.

MAP, discounted cumulative gain)

• Challenge:• Relevance of document (at rank r) depends on other documents (at

rank r’<r) = relative relevance• Novelty relative to other search result matters (diversity is good!)

• Effectively one needs to look at ranking and model inter-document dependencies = collective classification problem

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

70

Diversity via Topic Coverage

• Manually annotate results with sub-topics (for selected set oftraining queries)

• Prediction goal: find top n (=10) documents (=y) that maximizetopic coverage, minimize weighted fraction of uncovered topics

• Budgeted maximum coverage problem -> (1-1/e) greedy approx.

• How to generalize to new queries? Learn a term weightingfunction.

• T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. ICML 2008.• Y. Yue and T. Joachims. Predicting diverse subsets using structural SVMs. ICML 2008

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

71

Experiments

• TREC 6-8 interactive track queries• Fine-grained topic annotations• Comparison with Okapi, unweighted word model, Essential Pages

algorithm

• Y. Yue and T. Joachims. Predicting diverse subsets using structural SVMs. ICML 2008

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

72

Conclusions

• Generalization of SVMs to structured classification viacompatibility functions

• SVMstruct: Hinge-loss induced sparseness, column generationapproach to produce sequence of relaxations

• efficient and highly generic approach• simple and theoretically well understood• alternative: clique-based representer theorem

• Wide range of applications• structured prediction problems• collective classification• many use cases in information retrieval

Thom

as H

ofm

ann,

Sup

port

Vec

tor

Mac

hine

Lea

rnin

g fo

r St

ruct

ured

Pre

dici

tion

, M

LSS

2009

73