structured prediction - cambridge machine learning...
TRANSCRIPT
1
Structured Prediction
(with Application in Information Retrieval)
Thomas HofmannGoogle, Switzerland
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
2
1 Structured Classification
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
3
Supervised Machine Learning
Learn functional dependencies between inputs and outputs fromtraining data (inductive inference)Most basic form: classification
uoptical character
recognition
document categorization
M13 = MONEY MARKETS
M132 = FOREX MARKETS
MCAT = MARKETS
UK = United Kingdom
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
4
Automatic Document Categorization
document categorization M13 = MONEY MARKETS
M132 = FOREX MARKETS
MCAT = MARKETS
UK = United Kingdom
M132 = FOREX MARKETS
M132 = FOREX MARKETS
training examples
expertlearning machine
/* some ‘complicated’ algorithm */
inductiveinference
advantages: implicit knowledge (in action) no knowledge engineering fully automatic & adaptive human-level accuracy
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
5
Outputs are either
• Structured objects such as sequences,strings, trees, etc.
• Multiple response variables that areinterdependent (e.g. dependenciesmodeled by probabilistic graphicalmodels) = collective classification
Structured Classification
Generalize classification methods to deal with structured outputsand/or with multiple, interdependent outputs
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
6
r
Part-Of-Speech Tagging
INPUTProfits soared at Boeing Co., easily toppingforecasts on Wall Street, as their CEO AlanMulally announced first quarter results.
OUTPUTProfits/N soared/V at/P Boeing/N Co./N ,/,easily/ADV topping/V forecasts/N on/PWall/N Street/N ,/, as/P their/POSS CEO/NAlan/N Mulally/N announced/V first/ADJquarter/N results/N ./.
N = nounV = verbP = prepositionADV = adverb...
[Example taken from Michael Collins]
Challenge:Predict a tag sequence
• R. Garside, The CLAWS Word-tagging System. In: R. Garside, G. Leech and G. Sampson (eds), The ComputationalAnalysis of English: A Corpus-based Approach. London: Longman, 1987.
• Y.Altun, I. Tsochantaridis, T. Hofmann, Hidden Markov Support Vector Machines, ICML, 2003
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
7
Information Extraction
The outcome of a possible The outcome of a possible <organization><organization> WTO WTO</organization></organization> cat fight is far from clear. Even if the cat fight is far from clear. Even if the<country><country> United States United States </country></country> won, nothingwon, nothingmight change, said might change, said <person><person> Richard Richard AboulafiaAboulafia</person></person> , aviation analyst with the , aviation analyst with the <organization><organization>Teal Group Teal Group </organization></organization>, an industry consulting, an industry consultingfirm in firm in <location><location> Fairfax, Va. Fairfax, Va. </location></location>
Challenge:Predict a labeled segmentation
• J. Lafferty, F. Pereira, and A. McCallum. Conditional random fields: Probabilistic models for segmenting andlabeling sequence data. ICML, 2001
• Y.Altun, I. Tsochantaridis, T. Hofmann, Hidden Markov Support Vector Machines, ICML, 2003
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
8
Natural Language Parsing
Syntactical analysis of natural language,e.g. based on a context-free grammar
Challenge:Predict a labeled parse tree
• M Collins, Discriminative Reranking for Natural Language Parsing, ICML 2000.• B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, Max-Margin Parsing, EMNLP, 2004.• I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support Vector Machine Learning for Interdependent and
Structured Output Spaces, ICML 2004.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
9
Optical Character Recognition
Recognize handwritten wordsTr
aini
ng s
et
Challenge:Predict a character sequence
• B. Taskar, C. Guestrin, and D. Koller. Max margin Markov networks. NIPS, 2003.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
10
r
Visual Scene Analysis
Recognize objects in a cameraimage or video
• A. Torralba, K. Murphy and W. Freeman, Contextual Models for Object Detection using Boosted Random Fields,NIPS 2004.
[Example taken from Torralba et al.]
Challenge:Predict a scene graph
Challenge:Recognize objects in context
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
11
Protein Structure Prediction
Predict protein secondary and tertiary structure from its primarystructure
… AIVVTQHGRVTSVAAQHHWATAIAVLGRPK …
… BBBBBBB----AAAAAAAAAAAAAA ----- …
… AIVVTQHGRVTSVAAQHHWATAIAVLGRPK …
… BBBBBBB----AAAAAAAAAAAAAA ----- …
• Y. liu, E. P. Xing, and J. Carbonell, Predicting protein folds with structural repeats using a chain graph model,ICML 2005
Challenge:Predict labeled segmentation
Challenge:Binary contact matrix
2 Multiclass Classification
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
13
Linear Multiclass Classification
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
14
Linear Multiclass Classification - Geometry
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
15
Multiclass Perceptron
• Michael Collins, Discriminative training methods for hidden Markov models: theory and experiments withperceptron algorithms, EMNLP 2002
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
16
Multiclass SVMs: Separation Margin
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
17
Multiclass SVM: Separation Margin, Illustrated
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
18
Multiclass SVMs: Formulation
• Koby Crammer, Yoram Singer, On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines,Journal of Machine Learning Research, 2001.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
19
From Multiclass to Structured Prediction
• Thorsten Joachims, Thomas Hofmann, Yisong Yue, Chun-Nam Yu, Predicting Structured Objects with Support VectorMachines, Communications of the ACM, 2009 (to appear).
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
20
From Multiclass to Structured Prediction
1. Compact Representation: Define joint input-output featurefunctions and/or kernels that ensure generalization capabilitiesacross inputs and outputs.
2. Efficient search: Develop problem or representation specificsearch algorithms, e.g. based on dynamic programming, minflow, etc., to find optimal output for given input
3. Refined error metric: Integrate application-specific lossfunction into problem formulation
4. Training algorithm: Use efficient output search in conjunctionwith (sparse) cutting plane optimization method
3 Function Classes forStructured Classification
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
22
Linear compatibilty functions in joint feature representationof input/output pairs
Prediction
Binary classification
Compatibility Functions
encodes combined properties ofinputs and outputs:which aspect of the input is relevantfor which part of the output?
for given input predict mostcompatible output (may involvesearch)
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
23
Structured inputs are usually handled with appropriate featureextraction methods, i.e. for input space X, find a feature map:
With the use of kernels, the feature map can be implicit and isnot restricted to finite d.
Examples: String kernels [LSST+02], [LK04], [VS03]: number of commonsubsequences (with gaps) Tree kernels and convolutional kernels [CD01] Graph kernels [GLF02], [Gär03] Set kernels [KJ03] Research topics: modeling power, applications &experimental investigations, computational improvements
Kernels for Discrete Inputs
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
24
Joint input/output features
Joint Feature Map: Information Extraction... IN REALITY TOUCH PANEL SYSTEMS CAPITALIZED AT ...
the word “systems” occurring in a company name segment
the word “systems” occurring at the end of a company name
the word “capitalized” following a company name segment
a location segment following the word “in” (how often?)
a location segment being followed by a company segment
a company name segment consisting of 3 words
COMPANYCOMPANY
COMPANY LOCLOC...
1
1
1
2
2
2
3
3 4
4
6
6
5
5
• Y.Altun, I. Tsochantaridis, T. Hofmann, Hidden Markov Support Vector Machines, ICML, 2003• S. Sarawagi & W. W. Cohen, Semi-Markov Conditional Random Fields for Information Extraction, NIPS 2004
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
25
Joint Feature Maps: Parsing
• I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support Vector Machine Learning for Interdependent andStructured Output Spaces, ICML 2004.
• B. Taskar, D. Klein, M. Collins, D. Koller, and C. Manning, Max-Margin Parsing, EMNLP, 2004.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
26
Tensor Product Feature Maps
• I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support Vector Machine Learning for Interdependent andStructured Output Spaces, ICML 2004.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
27
Sequence Feature Maps
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
28
Reproducing Kernel Hilbert Spaces
compatibility functionsover inputs and outputs
generating kernel
RKHS
expansioncoefficients
inner product/norm
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
29
Special case: kernels defined via inner products
Tensor product feature maps
Kernels serve two purposes: flexible and efficient featureextraction, efficient combination of input/output features
RKHS and Finite-dimensional Vector Spaces
linear functions in some finitedimensional featurerepresentation as special case
finite-dimcase
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
30
4 Statistical Inference forStructured Prediction
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
31
Regularization & Representer Theorem
Combine risk function with appropriate regularization functional(Hilbert space norm)
Representer Theorem
• B. Schölkopf, R. Herbrich, A. Smola, and R. C. Wiliamson, A Generalized Representer Theorem, In Proceedings ofthe Annual Conference on Computational Learning Theory, pages 416-426, 2001.
• T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals ofStatistics, 2008
trainingsample
stabilizer(penalize norm)
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
32
Exponential Family Conditional Models
In order to derive learning/inference algorithms, a probabilisticinterpretation of compatibility functions is fruitful.
Define family of conditional distributions
higher compatibility, higher conditional probability
logpartitionfunction
sufficient statistics (given x)canonical parameters
• B. O. Koopman. On distributions admitting a sufficient statistic. Trans. of the American Math. Society, 39, 1936.• M. K. Murray and J.W. Rice. Differential geometry and statistics. Chapman and Hall, 1993.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
33
(Negative) Conditional log-likelihood
Risk Functions: Log-Loss
maximize compatibility on training pairs
minimize log-partition function
|f(xi,y’’)
|f(xi,y’’’)
|f(xi,y’)
|f(xi,yi)
Conditional Random Field(CRF)
• J. Lafferty, F. Pereira, and A. McCallum. Conditional random fields: Probabilistic models for segmenting andlabeling sequence data. ICML, 2001
• S. Kakade, Y.W. Teh, and S. Roweis. An alternative objective function for Markovian fields. ICML, 2002.• Y. Altun. M. Johnson, and T. Hofmann, Investigating Loss Functions and Optimization Methods for Discriminative
Learning of Label Sequences, EMNLP, 2003.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
34
Parametric case
Gradient-based Methods
Numerical optimization• Iterative scaling (not recommended)• Preconditioned conjugate gradient• Limited Memory Quasi-Newton (BFGS)• Bayesian CRFs (avoid overfitting)
Learning CRFs
• J. Lafferty, F. Pereira, and A. McCallum. Conditional random fields: Probabilistic models for segmenting andlabeling sequence data. ICML, 2001
• H. M. Wallach. Efficient training of conditional random fields. Master’s thesis, University of Edinburgh, 2002.• Y. Qi, M. Szummer, T. Minka, Bayesian Conditional Random Fields, AISTATS, 2005.
expectation over outputsunder current conditional model
exact inference (e.g. forward-backward,junction tree) or approximate inference
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
35
Regularization & Representer Theorem
Representer Theorem (again)
sum can be huge, if output space is combinatorial – sparseness is crucial
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
36
Minimum Log-Odds
Hinge Loss
Risk Functions: Hinge Risk
difference between compatibility of correct vs. best incorrect output
1
differences of 1 and greaterare ok, otherwise linear penalty
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
37
Minimize Hinge loss = structured Support Vector Machines
PrimalQP
SVMstruct: Problem Formulation
• I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support Vector Machine Learning for Interdependent andStructured Output Spaces, ICML 2004.
number of constraints =cardinality of output space
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
38
Soft Margin: Illustration
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
39
Dual problem formulation
DualQP
SVMstruct: Dual Formulation
Block structure:Independent constraintsfor every training example
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
40
SVMstruct: Algorithm
Constraint generation
Adding constraint
Re-optimization
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
41
SVMstruct: Variable Selection
• Incrementally add constraints (sequential strengthening of QP)• Corresponds to variable selection in the dual QP• Sequentially optimize dual over selected variables
• Select maximally violating constraints (or at least є-violatingconstraints) – requires black box search to find best or secondbest outputs
Theorem:
After selecting at most є-violating constraints no
remaining constraint is violated by more than є.
• I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and InterdependentOutput Variables, JMLR 2005.
• http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
42
SVMstruct: Proof of theorem (sketch)
• Dual objective is upper bounded by primal, which is upperbounded by nC (zero weight vector).
• We show that each variable selections for an ε-violated constraintwill increase the dual objective by some minimum (bounded awayfrom zero).
• This limits the number of sequentially ε-violated constraints.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
43
SVMstruct: Support Outputs
• SVM: sparseness = select some examples• SVMstruct: sparseness = select some outputs for some examples
• Example: named entity tagging (Spanish corpus)
PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA MERIDA ( EFE ) .O N N N M m m N N N O L N O N N
ONNNMmmNNNOLNONN--------M------- Support Vector Sequences------N--------- (only difference to correct sequence shown)----------P---------N-----------N---P----------------------o----
• Y.Altun, I. Tsochantaridis, T. Hofmann, Hidden Markov Support Vector Machines, ICML, 2003
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
44
5 Loss Functions
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
45
Problem-Specific Loss Functions
Problem-specific prediction loss functions
This is important to model quality of predictions indomain/application-dependent way.
Example: Hamming loss
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
46
Margin Re-scaling
We are requiring different margins for different outputs
This results in a modified Hinge loss
• B. Taskar, C. Guestrin and D. Koller, Maximum Margin Markov Networks, NIPS 2003.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
47
Dual problem formulation
DualQP
SVMstruct: Dual for Margin Re-scaling
Minor modification in boundconstraints
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
48
Black Box: Constraint Selection
Without margin re-scaling, we need a blackbox to solve:
With margin re-scaling, this becomes:
Requirements:• Search (argmax) needs to be feasible• Second best output may be required• Loss function needs to decompose in similar manner
Many tractable use cases: dynamic programming, min-cut, etc.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
49
6 Compositionality andGraphical Models
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
50
Compositionality
Additive decomposition of compatibility function over “parts“
Induces decompositionality of kernel functions (proof elementary)
dependency on subset of variables
additivityover parts
• T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals ofStatistics, 2008
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
51
Markov Networks
Conditional independence graph
• distributions with full support pairwise = local = global Markov property
Markov Networks conditional independencebetween variables corresponds tograph separation
product over clique potential functions(Hamersley-Clifford)
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
52
Markov Network Kernels: Example
Label sequence learning with HMM structure
Overlapping input features
y
y’
u
u’
y
x
u
x’
y y’
x
u u’
x’
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
53
Representer Theorem over Cliques
• Standard representer theorem yields expansion with terms
• SVMstruct: sparseness due to Hinge loss properties• Logistic loss: no sparseness, too many parameters
Clique-based Sparse Approximation
problematic
• B. Taskar, C. Guestrin and D. Koller, Maximum Margin Markov Networks, NIPS 2003.• J. Lafferty, X. Zhu, and Y. Liu, Kernel conditional random fields: Representation and clique selection, ICML, 2004.• Y. Altun, A. Smola, and T. Hofmann, Exponential Families for Conditional Random Fields, UAI 2004.• T. Hofmann, B. Schölkopf, and A. Smola, A Tutorial Review of RKHS Methods in Machine Learning, Annals of
Statistics, 2008
For decomposable RKHS:
much fewer variables (further: parameter tying)
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
54
SVMstruct with Approximate Inference
• Undergenerating methods
• Examples: Greedy, loopy belief propagation• Theoretical analysis: convergence, bounded (but not arbitrary small)
overall approximation quality, if quality of approximation can bebounded
• Overgenerating methods
• Examples: LP relaxation (leading to solutions over {0,1/2,1}) [Boros &Hammer 2002] & Quadratic pseudo-Boolean optimization using graph cuts[Kolmogorov & Rother 2004]
• Use fractional numbers (1/2) in constraints -> overly constrainedproblem, but feasible solution will be found
• T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. ICML 2008.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
55
SVMstruct with Approximate Inference
• Evaluation on multi-labeled classification = Multiplecoupled/overlapping binary classification problems• Dependencies modeled as a binary MRF• Relaxation methods (overgenerating) work best for training.• Adding fractional constraints during training helps avoid fractionalsolutions in inference.
• T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. ICML 2008.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
56
7 Applications
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
57
7.1 Named EntityClassification
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
58
Application: Named Entity Classification
• Goal: finding names in sentences• Label set: beginning and continuation
of person, location, organization andmiscellaneous names and non-name
• Data set: Spanish newswire corpus,CoNLL 2002
• Input features: word identity andspelling properties of current,previous, and next observation
The outcome of a possibleThe outcome of a possible<organization><organization> WTO WTO </organization></organization>cat fight is far from clear. Even if thecat fight is far from clear. Even if the<country><country> United States United States </country></country>won, nothing might change, saidwon, nothing might change, said<person><person> Richard Richard AboulafiaAboulafia </person></person>, aviation analyst with the, aviation analyst with the<organization><organization> Teal GroupTeal Group</organization></organization>, an industry consulting, an industry consultingfirm in firm in <location><location> Fairfax, Va. Fairfax, Va.</location></location>
Named Entity Classification
0
2
4
6
8
10
Erro
r %
Error 9,36 5,62 5,17 5,94 5,08
HMM CRF CRF-B HM-PC HM-SVM
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
59
7.2 HierarchicalClassification
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
60
Taxonomies: IPC
D: Textiles; Paper
D01: Natural orartificial threads orfibres; Spinning
D02: Yarns; Warpingor Beaming; ...
D03: Weaving
D04: Braiding; LaceMaking; Knitting; ...
D06: Treatment ofTextiles; ...
D05: Sewing;Embroidering; Tufting
D07: Ropes; ...
D21: Paper; ...
D03C: Shedding mechanisms;Pattern cards or chains; Punchingof cards; Designing patterns
D03D: Woven fabrics;Methods of weaving;Looms
D03J:Auxiliary weavingapparatus; Weavers’ tools;Shuttles
A: Human Necessities
B: Performing Operations; Transporting
C: Chemistry; MetallurgyE: Fixed Construction
F: Mechanical Engineering
G: Physics
H: Electricity
• World Intellectual Property Organization. International patent classification. URL, 2001.http://www.wipo.int/classifications/en/
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
61
Class Attributes from Taxonomies
Hierarchical Classification with Taxonomies
9 10
6 7 8
1 2 3 4 5
similarity function over categories(generalization across categories, notjust across inputs)
• T. Hofmann, L. Cai, and M. Ciaramita, Learning with Taxonomies: Classifying Documents and Words, Workshop onSyntax, Semantics, and Statistics, NIPS, 2003
special case, multiclass = orthogonalclass representation
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
62
Hierarchical Classification
Model• Input feature map: vector space
representation of documents• Class attributes from IPC taxonomy• (Tree-specific loss function)• Soft-margin objective and SVMstruct –
explained later!
0
0,5
1
1,5
2
%
SVMflat
SVMstruct
SVMflat 1,941 1,323
SVMstruct 1,822 1,228
3samp full
Experiments• WIPO-alpha: patent database
(2002)• Based on IPC taxonomy• Trained on different sections, full
and subsampled (3 trainingexamples per leaf category)
• L. Cai and T. Hofmann, Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
63
7.3 Binary Classificationwith Unbalanced Classes
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
64
Unbalanced Classes in Text Categorization
• Binary text categorization: only very few positive examples (<1%)• Classification error = poor loss function (99%+ by trivial classification)
• More suitable error metrics:• F1 score = harmonic mean of precison and recall)• Precision-recall breakeven point
• Problem: functions of sets of (not individual) predictions ->structured classification
• Representation (reduces to standard classification for 0/1 loss)
• Constraint generation/prediction is efficient for metrics thatdepend on contingency table of prediction errors
• T. Joachims. A support vector method for multivariate performance measures. ICML 2005.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
65
Experiments
• Comparison on 4 data sets• Plain binary SVM classification vs. SVMs trained on specific loss
• T. Joachims. A support vector method for multivariate performance measures. ICML 2005.
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
66
7.4 Learning to Rank
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
67
Optimizing MAP with SVMstruct
• Loss function = 1 - MAP (mean average precision)• Inputs q = search query• Outputs y = rankings of documents/URLs (matrix of pairwise
orderings)• Training data: set of relevant and irrelevant documents per query
• Efficient constraint generation: exploit fact that MAP onlydepends on label sequence
• Yisong Yue, T. Finley, F. Radlinski, T. Joachims, A Support Vector Method for Optimizing Average Precision. SIGIR2007.
Comparison w/Indri functions(Lemur project)
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
68
7.5 OptimizingSearch Diversity
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
69
Search Result Diversity
• Traditionally:• Assess relevance of a document on ordinal scale (e.g. binary)• Quality of ranking = only depends on sequence of relevance scores (e.g.
MAP, discounted cumulative gain)
• Challenge:• Relevance of document (at rank r) depends on other documents (at
rank r’<r) = relative relevance• Novelty relative to other search result matters (diversity is good!)
• Effectively one needs to look at ranking and model inter-document dependencies = collective classification problem
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
70
Diversity via Topic Coverage
• Manually annotate results with sub-topics (for selected set oftraining queries)
• Prediction goal: find top n (=10) documents (=y) that maximizetopic coverage, minimize weighted fraction of uncovered topics
• Budgeted maximum coverage problem -> (1-1/e) greedy approx.
• How to generalize to new queries? Learn a term weightingfunction.
• T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. ICML 2008.• Y. Yue and T. Joachims. Predicting diverse subsets using structural SVMs. ICML 2008
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
71
Experiments
• TREC 6-8 interactive track queries• Fine-grained topic annotations• Comparison with Okapi, unweighted word model, Essential Pages
algorithm
• Y. Yue and T. Joachims. Predicting diverse subsets using structural SVMs. ICML 2008
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
72
Conclusions
• Generalization of SVMs to structured classification viacompatibility functions
• SVMstruct: Hinge-loss induced sparseness, column generationapproach to produce sequence of relaxations
• efficient and highly generic approach• simple and theoretically well understood• alternative: clique-based representer theorem
• Wide range of applications• structured prediction problems• collective classification• many use cases in information retrieval
Thom
as H
ofm
ann,
Sup
port
Vec
tor
Mac
hine
Lea
rnin
g fo
r St
ruct
ured
Pre
dici
tion
, M
LSS
2009
73