structured svm chen-tse tsai and siddharth gupta

Structured SVM Chen-Tse Tsai and Siddharth Gupta Slide 2 Outline Introduction to SVM Large Margin Methods for Structured and Interdependent Output Variables (Tsochantaridis et. al., 2005) Max-Margin Markov Networks (Taskar et. al., 2003) Learning Structural SVMs with Latent Variables (Yu and Joachims, 2009) 2 Slide 3 SVM- The main idea Slide 4 Maximum margin Find w and b such that is maximized and for all (x i, y i ), i=1..n : y i (w T x i + b) 1 Find w and b such that (w) = ||w|| 2 =w T w is minimized and for all (x i, y i ), i=1..n : y i (w T x i + b) 1 quadratic optimization problem r Slide 5 Binary SVM Training examples: Primal form: Dual form: Slide 6 Multiclass SVM Slide 7 Structured Output Approach: view as multi-class classification task Every complex output is one class Problems: Exponentially many classes How to predict efficiently? How to learn efficiently? Potentially huge model Manageable number of features? The dog chased the cat x S VPNP DetNV NP DetN y2y2 S VP DetNV NP VN y1y1 S VP DetNV NP DetN ykyk 7 Slide 8 Multi-Class SVM (Crammer & Singer, 2001) Training Examples: Inference: Training: Find that solve 8 Slide 9 Multi-Class SVM (Crammer & Singer, 2001) The dog chased the cat x S VPNP DetNV NP DetN y1y1 S VP DetNV NP VN y2y2 S VP NP y 58 S VPNP DetNV NP DetN y 12 S VPNP DetNV NP DetN y 34 S VPNP DetNV NP DetN y4y4 9 Slide 10 Joint Feature Map Problem: exponential number of parameters Feature vector that describes match between x and y Learn single weight vector. Inference The dog chased the cat x S VPNP DetNV NP DetN y1y1 S VP DetNV NP VN y2y2 S VP DetNV NP DetN y 58 S VPNP DetNV NP DetN y 12 S VPNP DetNV NP DetN y 34 S VPNP DetNV NP DetN y4y4 10 Slide 11 Joint Feature Map for Trees Weighted Context Free Grammar Each rule has a weight Score of a tree is the sum of its weight Find highest scoring tree Using CKY Parser The dog chased the cat S VPNP DetNV NP DetN Thecatthechaseddog x y 11 Slide 12 Structured SVM Hard margin 12 Slide 13 Structured SVM Soft Margin SVM 1 SVM 2 13 Slide 14 General Loss Function measures the difference between prediction y, and the true value y i. The y with high loss should be penalized more severely. Slack re-scaling Margin re-scaling 14 Slide 15 A Cutting Plane Algorithm Only polynomial number of constraints are needed. 15 Slide 16 A Cutting Plane Algorithm Cutting plane algorithm 16 Slide 17 Computational problem Prediction: Get the most violated constraint: Approximate inference methods in MRF Training Structural SVMs when Exact Inference is Intractable. T. Finley, T. Joachims, ICML 2008 17 Slide 18 Outline Large Margin Methods for Structured and Interdependent Output Variables (Tsochantaridis et. al., 2005) Max-Margin Markov Networks (Taskar et. al., 2003) Learning Structural SVMs with Latent Variables (Yu and Joachims, 2009) 18 Slide 19 Max-Margin Markov Network Structured SVM entails a large number of constraints So far, handled by adding one constraint a time M 3 network A way to solve SVM 1 with margin re-scaling Use Markov network to encode dependency and generate features Reduce exponential to polynomial number of constraints. 19 Slide 20 M 3 Network A way to generate features. Define features on the edges The k-th feature of this instance The loss function 20 Slide 21 M 3 Network A way to solve SVM 1 with margin re-scaling Primal: Dual: Only need node and edge marginal probability to compute expectation 21 Slide 22 Polynomial-Size Reformulation The key step 22 y0y0 y1y1 y2y2 t x (y) x (y) All possible y 11110.1 1102 1010 1001 0112 01030.2 0011 00020.1 Gold y101 x (0)0.60.5 x (1)0.40.5 Slide 23 Polynomial-Size Reformulation The key step Marginal dual variables New constraints Tree structure: 23 Slide 24 Polynomial-Size Reformulation Factored dual QP #variables and #constraints: N2 M down to N(M 2 +M) N: number of instances, M: the length of y Problem If the structure is not simple, we may need exponential number of new constraints Enforce only local consistency of marginals, get an approximate result 24 Slide 25 SMO Sequential minimal optimization In binary SVM, we have a linear constraint Working set selection: select the two variables to update M 3 net: 25 Slide 26 Experimental Results Max-Margin Parsing (Taskar et. al, 2004) Apply M 3 Net to parsing Discussed how to extract features from a grammar 26 Slide 27 Outline Large Margin Methods for Structured and Interdependent Output Variables (Tsochantaridis et. al., 2005) Max-Margin Markov Networks (Taskar et. al., 2003) Learning Structural SVMs with Latent Variables (Yu and Joachims, 2009) 27 Slide 28 Latent Variable Models Widely used in machine learning and statistics Unobserved quantities/missing data in experiments Dimensionality Reduction Classical examples: Mixture models, PCA, LDA This paper: Latent variables in supervised prediction tasks Slide 29 Latent Variables in S-SVMs How can we extend structural SVM to handle latent variables? Slide 30 Structured SVM Slide 31 Latent S-SVM Formulation Slide 32 Slide 33 CCCP Algorithm Slide 34 aa Slide 35 Noun Phrase Co-reference Slide 36 Slide 37 Noun phrase co-reference results Slide 38 38

structured svm chen-tse tsai and siddharth gupta

Documents

s vpnp detnv np detn

multiclass svm slide

y i w t x

structured svm soft

prediction y

structured svm hard

siddharth gupta slide

main idea slide