semi-supervised structured prediction models ulf brefeld christoph thomas peter tobias stefan...

56
Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Joint work with…

Upload: myles-stevens

Post on 11-Jan-2016

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Semi-supervised Structured Prediction Models

Ulf Brefeld

Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien

Joint work with…

Page 2: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

w

Binary Classification

+

-+

+

- -

Inappropriate for complex real world problems.

Page 3: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Label Sequence Learning

Protein secondary structure prediction:

Named entity recognition (NER):

x = “Tom comes from London.” y = “Person,–,–,Location”

x = “The secretion of PTH and CT...” y = “–,–,–,Gene,–,Gene,…”

Part-of-speech (POS) tagging:

x = “Curiosity kills the cat.” y = “noun, verb, det, noun”

x = “XSITKTELDG ILPLVARGKV…” y = „ SS TT SS EEEE SS…“

Page 4: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Natural Language Parsing

x = „Curiosity kills the cat“ y =

Classification with Taxonomies

x = y =

Page 5: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Given: n labeled pairs (x1,y1),…,(xn,yn)XxY, drawn iid according to

Learn a ranking function: with Decision value measures how good y fits to x.

Compute prediction:

Find hypothesis that realizes the smallest regularized empirical risk:

Structural Learning

Log-loss: kernel CRFs

hinge loss: M3Networks,

SVMs

model:

inference/decoding

Page 6: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Semi-supervised Discriminative Learning

Labeled training data is scarce and expensive. Eg., experiments in computational biology. Need for expert knowledge. Tedious and time consuming.

Unclassified instances are abundant and cheap. Extract texts/sentences from www (POS-tagging, NER, NLP). Assess primary structure of proteins from DNA/RNA. …

There is a need for semi-supervised techniques in structural learning!

Page 7: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Overview

1. Semi-supervised learning.

1. Co-regularized least squares regression.

2. Semi-supervised structured prediction models.

1. Co-support vector machines.

2. Transductive SVMs and efficient optimization.

3. Case study: email batch detection

1. Supervised Clustering.

4. Conclusion.

Page 8: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Overview

1. Semi-supervised learning techniques.

1. Co-regularized least squares regression.

2. Semi-supervised structured prediction models.

1. Co-support vector machines.

2. Transductive SVMs and efficient optimization.

3. Email batch detection

1. Supervised Clustering.

4. Conclusion.

Page 9: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Now: m unlabeled inputs in addition to the n labeled pairs are given. m>>n. Decision boundary should not cross high density regions.

Examples: transductive learning, graph kernels,… But: cluster assumption is frequently inappropriate, eg., regression! What else can we do?

Cluster Assumption

+-

Page 10: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Learning from Multiple Views / Co-learning

Split attributes into 2 disjoint sets (views) V1, V2. E.g., web page classification.

View 1: content of web page. View 2: anchor text of inbound links.

In each view learn a hypothesis fv, v=1,2. Each fv provides its peer with predictions on unlabeled examples. Strategy: maximize consensus between f1 and f2.

Aachen

ZZ-Top

AalsmeerAaron

Aachen

ZZ-Top

AalsmeerAaron

intrinsic contextual

Page 11: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Hypothesis Space Intersection

Hypothesis spaces H1 und H2. Minimize error rate and disagreement for all hypotheses in H1H2. Unlabeled examples = data-driven regularization!

true labeling function

version spacehypothesis space

View V1 View V2

intersection H1H2

Consensus maximization principle:

Labeled examples → minimize the error. Unlabeled examples → minimize disagreement.

Minimize an upper bound on the error!

Page 12: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Co-optimization Problem

Given: n labeled pairs: (x1,y1),…,(xn,yn) XxY

m unlabeled inputs: xn+1,…,xn+m X

Loss function: Δ:YxY→R+

V hypotheses: f1,…,fV H1x…x HV

Goal:

Representer theorem:

Q(f1,…fV) = Δ(yi,argmaxy’ fv(xi,y’)) + η ||fv||2

+ λ Δ(argmaxy’ fu(xj,y’),argmaxy’’fv(xj,y’’))

i=1

n

v=1

V

u,v=1

V

j=n+1

n+m

min

empirical risk of fv

pairwise disagreements

regularization

Page 13: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Overview

1. Semi-supervised learning techniques.

1. Co-regularized least squares regression.

2. Semi-supervised structured prediction models.

1. Co-support vector machines.

2. Transductive SVMs and efficient optimization.

3. Email batch detection

1. Supervised Clustering.

4. Conclusion.

Page 14: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Semi-supervised Regularized Least Squares Regression

Special case: Output space Y=R . Consider functions

Squared loss:

Given: n labeled examples m unlabeled inputs V views (V kernel functions )

Consensus maximization principle: Minimize squared error for labeled examples. Minimize squared differences for unlabeled examples.

Page 15: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

disagreement

Co-regularized Least Squares Regression

strictly positive definite if K_v is strictly positive

definite

strictly positive definite if is strictly positive definite

Kernel matrix: Optimization problem:

Closed-form solution:

empirical risk regularization

Page 16: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Kernel matrix: Optimization problem:

Closed-form solution:

Execution time:

disagreement

Co-regularized Least Squares Regression

empirical risk regularization

as good (or bad) as the state-of-the-art

Page 17: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Restrict hypothesis space:

Convex objective function:

Semi-parametric Approximation

Page 18: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Restrict hypothesis space:

Convex objective function:

Solution:

Execution time:

Semi-parametric Approximation

only linear in the amount of unlabeled

data

Page 19: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Semi-supervised Methods for Distributed Data

Participants keep labeled data private. Agree on fixed set of unlabeled data.

Converges to global optimum.

Page 20: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Empirical Results

32 UCI data sets, 10 fold “inverse” cross validation. Dashed lines indicate equal performance.

RMSE: exact coRLSR , semi-parametric c < RLSR

Results taken from:Brefeld, Gärtner, Scheffer, Wrobel, “Efficient CoRLSR”, ICML 2006

coRLSR (exact) coRLSR (approx.) RLSR

Page 21: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Empirical Results

32 UCI data sets, 10 fold “inverse” cross validation. Dashed lines indicate equal performance.

RMSE: exact coRLSR < semi-parametric c < RLSR

Results taken from:Brefeld, Gärtner, Scheffer, Wrobel, “Efficient CoRLSR”, ICML 2006

coRLSR (exact) coRLSR (approx.) RLSR

Page 22: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Execution Time

Exact solution is cubic in the number of unlabeled examples. Approximation only linear!

Results taken from:Brefeld, Gärtner, Scheffer, Wrobel, “Efficient CoRLSR”, ICML 2006

Page 23: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Overview

1. Semi-supervised learning techniques.

1. Co-regularized least squares regression.

2. Semi-supervised structured prediction models.

1. Co-support vector machines.

2. Transductive SVMs and efficient optimization.

3. Email batch detection

1. Supervised Clustering.

4. Conclusion.

Page 24: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Given n labeled examples m unlabeled inputs

Joint decision function:

where

Apply consensus maximization principle. Minimize the error for labeled examples. Minimize the disagreement for unlabeled examples.

Compute argmax Viterbi algorithm (sequential output) CKY algorithm (recursive grammar)

Semi-supervised Learning for Structured Output Variables

Distinct joint feature mappings

in V1 and V2

Page 25: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

View v=1,2:

Dual representation:

Dual parameters are bound to input examples. Working sets associated with subspaces. Sparse models!

CoSVM Optimization Problemconfidence of

peer view

prediction of peer view

prediction of peer view

Page 26: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Error/Margin violation!1. Update Working set Ωi

2. Optimize αi

Working set Ωi = , αi=( ).

Labeled Examples, View v=1,2

=<N,V,V,N>

αj≠i fixed. Working set Ωj≠i fixed,

φv(xi,yi)-φv(xi,<N,V,V,N>) αiv(<N,V,V,N>)

φv(xi,yi)-φv(xi,<N,D,D,N>) αiv(<N,D,D,N>)

yi=<N,V,D,N>xi=“John ate the cat”

Viterbi Decoding

y=<N,D,D,N>=<N,V,D,N> Return αi, Ωi

vv

v

v

v

Page 27: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

=<N,V,N>

=<N,V,N>

Disagreement / margin violation!1. Update working sets Ωi

1, Ωi2

2. Optimize αi1, αi

2

=<N,V,V>

Working set Ωi = , αi=( ). φ2(xi,<D,V,N>)-φ2(xi,<N,V,V>) αi2(<N,V,V>)

Working set Ωi = , αi=( ),

Unlabeled Examples

=<D,V,N>

φ1(xi,<N,V,V>)-φ1(xi,<D,V,N>) αi1(<D,V,N>)

xi=“John went home”

Viterbi Decoding

y

αj≠i fixed, 1

1

Working set Ωj≠i fixed.1

1

Viterbi Decoding

y

αj≠i fixed, 2

2

Working set Ωj≠i fixed.2

2

View 1

View 2

2

1

Consensus: return αi1, αi

2, Ωi, Ωi

Page 28: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Biocreative Named Entity Recognition

BioCreative (Task1A, BioCreative Challenge, 2003).

7500 sentences from biomedical papers. Task: recognize gene/protein names. 500 holdout sentences. Approximately 350000 features (letter n-grams, surface clues,…) Random feature split. Baseline is trained on all features.

Results taken from:Brefeld, Büscher, Scheffer, “Semi-supervised Discriminative Sequential Learning”, ECML 2005

Page 29: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

CoSVM more accurate than SVM. Accuracy positively correlated with number of unlabeled examples.

Biocreative Gene/Protein Name Recognition

Results taken from:Brefeld, Büscher, Scheffer, “Semi-supervised Discriminative Sequential Learning”, ECML 2005

Page 30: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Natural Language Parsing

Results taken from:Brefeld, Scheffer, “Semi-supervised Learning for Structured Ouptut Variables”, ICML 2006

Wall Street Journal corpus (Penn tree bank). Subsets 2-21. 8,666 sentences of length ≤ 15 tokens. Contex free grammar contains > 4,800 production rules.

Negra corpus. German news paper archive. 14,137 sentences of between 5 and 25 tokens. CfG contains >26,700 production rules.

Experimental setup: Local features (rule identity, rule at border, span width, …). Loss: (ya,yb) = 1 - F1(ya,yb). 100 holdout examples. CKY parser by Mark Johnson.

Page 31: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

CoSVM significantly outperforms SVM. Adding unlabeled instances further improves F1 score.

Wall Street Journal / Negra Corpus Natural Language Parsing

Results taken from:Brefeld, Scheffer, “Semi-supervised Learning for Structured Ouptut Variables”, ICML 2006

Page 32: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Execution Time

CoSVM scales quadratically in the number of unlabeled examples.

Results taken from:Brefeld, Scheffer, “Semi-supervised Learning for Structured Ouptut Variables”, ICML 2006

Page 33: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Overview

1. Semi-supervised learning techniques.

1. Co-regularized least squares regression.

2. Semi-supervised structured prediction models.

1. Co-support vector machines.

2. Transductive SVMs and efficient optimization.

3. Email batch detection

1. Supervised Clustering.

4. Conclusion.

Page 34: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Transductive Support Vector Machines for Structured Variables

Binary transductive SVMs: Cluster assumption. Discrete variables for unlabeled instances. Optimization is expensive even for binary tasks!

Structural transductive SVMs. Decoding = combinatorial optimization of discrete variables. Intractable!

Efficient optimization: Transform, remove discrete variables. Differentiable, continuous optimization. Apply gradient-based, unconstraint optimization techniques.

Page 35: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Unconstraint Support Vector Machines

solving constraints for slack variables:solving constraints for slack variables:

BUT: Huber loss is!

hinge loss is not differentiable!

BUT: Huber loss is!

SVM optimization problem:

Unconstraint SVM:

Page 36: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Unconstraint Support Vector Machines

solving constraints for slack variables:solving constraints for slack variables:

still a max in the objective!

Substitute differentiable softmax for max!

SVM optimization problem:

Unconstraint SVM:

Differentiable objective without constraints!

Page 37: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Unconstraint SVM objective function:

Include unlabeled instances by an appropriate

loss function. Unconstraint transductive SVM objective:

Optimization problem is not convex!

Unconstraint Transductive Support Vector Machines

Mitigate margin violations by

moving w in two symmetric ways

loss function.

overall influence of unlabeled instances

2-best decoder

Page 38: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Gradient-based optimization faster than solving QPs. Efficient transductive integration of unlabeled instances.

Execution Time

Results taken from:Zien, Brefeld, Scheffer, “TSVMs for Structured Variables”, ICML 2007

+ 500 unlabeled examples

+ 250 unlabeled examples

Page 39: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Spanish News Wire Named Entity Recognition

Spanish News Wire (Special Session of CoNLL, 2002).

3100 sentences of between 10 and 40 tokens. Entities: person, location, organization and misc. names (9 labels). Window of size 3 around each token. Approximately 120,000 features (token itself, surface clues...). 300 holdout sentences.

Results taken from:Zien, Brefeld, Scheffer, “TSVMs for Structured Variables”, ICML 2007

Page 40: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

TSVM has significantly lower error rates than SVMs. Error decreases in terms of the number of unlabeled instances.

Spanish News Named Entity Recognition

Results taken from:Zien, Brefeld, Scheffer, “TSVMs for Structured Variables”, ICML 2007

number of unlabeled examples

toke

n er

ror

[%

]

Page 41: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Artificial Sequential Data

10 nearest neighbor Laplacian kernel vs. RBF kernel. Laplacian kernel well suited. Only little improvement by TSVM, if any. Different cluster assumptions:

Laplacian: local (token level). TSVM: global (sequence level).

Results taken from:Zien, Brefeld, Scheffer, “TSVMs for Structured Variables”, ICML 2007

RBF Laplacian

Page 42: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Overview

1. Semi-supervised learning techniques.

1. Co-regularized least squares regression.

2. Semi-supervised structured prediction models.

1. Co-support vector machines.

2. Transductive SVMs and efficient optimization.

3. Email batch detection.

1. Supervised Clustering.

4. Conclusion.

Page 43: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Supervised Clustering of Data Streams for Email Batch Detection

Spam characteristics: Amount of spam messages in electronic messaging is ~80%. Approximately 80-90% of these spams are generated by only a

few spammers. Spammers maintain templates and exchange them rapidly. Many emails generated by the same template (=batch) in short

time frames.

Goal: Detect batches in the data stream. Ground-truth of exact clusterings exist!

Batch information: Black/white listing. Improve spam/non-spam classification.

Page 44: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Template Generated Spam Messages

Dear Mr/Mrs, This is Brenda Dunn.We are accepting your mortga ge application.Our office confirms you can get a $228.000 lo an for a $371.00 per month payment. Follow the link to our website and submit your contact information. Best Regards, Brenda Dunn; Accounts ManagerTrades/Fina nce Department East Office

Hello, This is Terry Hagan.We are accepting your mo rtgage application. Our company confirms you are legible for a $250.000 loan for a $380.00/month. Approval process will take 1 minute, so please fill out the form on our website.Best Regards, Terry Hagan; Senior Account DirectorTrades/Fin ance Department North Office

Page 45: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

cxczc

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Correlation Clustering

Maximize intra-cluster similarity.

Parameterized similarity measure: Solution is equivalent to poly-cut in a fully connected graph. Edge weight is similarity of the connected nodes.

Page 46: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Parameterized similarity measure: Pairwise features:

Edit distance of subjects, tf.idf similarity of body, …

Collection x contains Ti messages x1(i),…,xTi.

Matrix with if and are in the same cluster and 0 otherwise.

Correlation clustering is NP complete! Solve relaxed variant instead:

Substitute continuous for

Problem Setting

Page 47: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Large Margin Approach

Structural SVM with margin rescaling:

minimize

subject to:

replace with Lagrangian dual

combine the minimizationscombine the

minimizations

QP with O(T3) constraints!

Page 48: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Only the latest email xt has to be integrated into the existing clustering.

Clustering on x1,…,xt-1 remains fixed.

Execution time is linear in the number of emails.

time

window

?

Exploit Data Stream!

Page 49: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Sequential Approximation

Exploit streaming nature of data:

Decoding strategy: Find the best cluster for the latest message or create a singelton.

objective of clustering

constant objective of sequential updatecomputation in O(T)

Page 50: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Results for Batch Detection

No significant difference.

Page 51: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Execution Time

Sequential approximation is efficient.

Results taken from:Haider, Brefeld, Scheffer, “Supervised Clustering of Streaming Data”, ICML 2007

Page 52: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Simple batch features increase AUC performance of spam/non-spam. Misclassification risk reduced by 40%!

Supervised Clustering of Data Streams for Email Batch Detection

(P. Haider, U. Brefeld und T. Scheffer, ICML 2007)

Results taken from:Zien, Brefeld, Scheffer, “TSVMs for Structured Variables”, ICML 2007

Results taken from:Haider, Brefeld, Scheffer, “Supervised Clustering of Streaming Data”, ICML 2007

Page 53: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Overview

1. Semi-supervised learning techniques.

1. Co-regularized least squares regression.

2. Semi-supervised structured prediction models.

1. Co-support vector machines.

2. Transductive SVMs and efficient optimization.

3. Email batch detection.

1. Supervised Clustering.

4. Conclusion.

Page 54: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Conclusion

Semi-supervised learning. Consensus maximization principle vs. cluster assumption. Co-regularized Least Squares Regression.

Semi-supervised structured prediction models: CoSVMs and TSVMs. Efficient optimization.

Empirical results: Semi-supervised variants have lower error than baselines. Adding unlabeled data further improves accuracy.

Supervised Clustering: Efficient optimization. Batch features reduce misclassification risk.

Page 55: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Overview

1. Semi-supervised learning techniques.

1. Co-regularized least squares regression.

2. Semi-supervised structured prediction models.

1. Co-support vector machines.

2. Transductive SVMs and efficient optimization.

3. Email batch detection.

1. Supervised Clustering.

4. Conclusion.

Page 56: Semi-supervised Structured Prediction Models Ulf Brefeld Christoph Thomas Peter Tobias Stefan Alexander Büscher Gärtner Haider Scheffer Wrobel Zien Joint

Ulf Brefeld : “Semi-supervised Structured Prediction Models”

Conclusion

Semi-supervised learning. Consensus maximization principle vs. cluster assumption. Co-regularized Least Squares Regression.

Semi-supervised structured prediction models: CoSVMs and TSVMs. Efficient optimization.

Empirical results: Semi-supervised variants have lower error than baselines. Adding unlabeled data further improves accuracy.