overview of kernel methods (part 2)

1

Overview of Kernel Methods (Part 2)

Steve VincentNovember 17, 2004

2

Overview Kernel methods offer a modular framework. In a first step, a dataset is processed into a kernel matrix.

Data can be of various types, and also heterogeneous types. In a second step, a variety of kernel algorithms can be

used to analyze the data, using only the information contained in the kernel matrix

3

What will be covered today PCA vs. Kernel PCA

Algorithm Example Comparison

Text Related Kernel Methods Bag of Words Semantic Kernels String Kernels Tree Kernels

4

PCA algorithm1. Subtract the mean from all the data points 2. Compute the covariance matrix S=3. Diagonalize S to get its eigenvalues and

eigenvectors4. Retain c eigenvectors corresponding to the c

largest eigenvalues such that equals the desired variance to be captured

5. Project the data points on the eigenvectors

N

iik

Tn xN

xEa1

1

c

n

Tnnxx

1

N

j j

c

j j 11/

5

Kernel PCA algorithm (1)1. Given N data points in d dimensions let X={x1|

x2|..|xN} where each column represents one data point

2. Subtract the mean from all the data points3. Choose an appropriate kernel k4. Form the NxN Gram matrix K|ij=[k(xi,xj)}5. Form the modified Gram matrix

where 1NxN is an NxN matrix with all entries equal to 1

NIK

NIK NxN

T

NxN 11~

6

Kernel PCA algorithm (2)6. Diagonalize K to get its eigenvalues n and its

eigenvectors an

7. Normalize an

8. Retain c eigenvectors corresponding to c largest eigenvalues such that

equals desired variance to be captured9. Project the data points on the eigenvectors

nna /

N

jj

C

jj

11

/

NK

xxk

xxk

NIay NxN

N

NxNT 1

),(

...

),(1

1

7

Data Mining Problem Data Source

Computational Intelligence and Learning Cluster Challenge 2000, http://www.wi.leidenuniv.nl/~putten/library/cc2000/index.html

Supplied by Sentient Machine Research, http://www.smr.nl

Problem Definition: Given data which incorporates both socio-economic and

various insurance policy ownership attributes, can we derive models which help in determining factors or attributes which may influence or signify individuals who purchase a caravan insurance policy.

8

Data Selection 5,822 records for training 4,000 records for evaluation 86 attributes

Attributes 1 through 43: Socio-demographic data derived from zip code

areas Attributes 44 through 85:

Product ownership for customers Attribute 86

Purchased caravan insurance

9

Principal Components Analysis (PCA)

# Attributes

% Variance

# Attributes

% Variance

25 73.29 55 98.20

30 80.66 60 98.86

35 86.53 65 99.35

40 91.25 70 99.65

4545 94.9494.94 75 99.83

50 97.26 80 99.96

Data Transformation and Reduction

[From MatLab]

10

Relative Performance PCA run time: 6.138 Kernel PCA run time: 5.668

Used Radial Basis Function Kernel

Matlab Code for PCA and Kernel PCA algorithm can be supplied if needed

2

exp),(vu

vuK

11

Manually Reduced Dataset:Naïve Bayes Overall – 82.79% Correctly Classified

a b3155 544 a 14.71% False Positive132 98 b 42.61% Correctly Classified

PCA Reduced Dataset:Naïve Bayes Overall – 88.45% Correctly Classified


Kernel PCA Reduced Dataset:Naïve Bayes Overall – 82.22% Correctly Classified


Modeling, Test and Evaluation

* Legend: a – no, b yes

12

Overall Results KPCA and PCA had similar time

performance KPCA is much gives results closer

to manually reduced dataset Future Work:

Examine other Kernels Vary the parameters for the Kernels Use other Data Mining Algorithms

13

‘Bag of words’ kernels (1) Document seen as a vector d, indexed by all

the elements of a (controlled) dictionary. The entry is equal to the number of occurrences.

A training corpus is therefore represented by a Term-Document matrix,

noted D=[d1 d2 … dm-1 dm] From this basic representation, we will apply

a sequence of successive embeddings, resulting in a global (valid) kernel with all desired properties

14

BOW kernels (2) Properties:

All order information is lost (syntactical relationships, local context, …)

Feature space has dimension N (size of the dictionary) Similarity is basically defined by: k(d1,d2)=d1•d2= d1

t.d2

or, normalized (cosine similarity):

Efficiency provided by sparsity (and sparse dot-product algorithm): O(|d1|+|d2|)

),().,(

),(),(ˆ

2211

2121

ddkddk

ddkddk

15

Latent concept Kernels Basic idea :

documents

termstermstermstermsterms

Concepts space

Size t

Size k <<t

Size d

1

2

K(d1,d2)=?

16

Semantic Kernels (1) k(d1,d2)=(d1)SS’(d2)’

where S is the semantic matrix S can be defined as S=RP where

R is a diagonal matrix giving the term weightings or relevances

P is the proximity matrix defining the semantic spreading between the different terms of the corpus

The measure for the inverse document frequency for a term t is given by:

The matrix R is diagonal with entries: Rtt=w(t)

)(ln)(

tdftw

l=# of documents

df(t)=# of documents containing term t

17

Semantic Kernels (2) The associated kernel is:

For the proximity matrix (P) the associated kernel is:

Where Qij encodes the amount of semantic relation between terms i and j.

)'(')(),(~

2121 dRRdddk

jji

iji dQddPPdddk )()()'(')(),(~

2,

12121

18

Semantic Kernels (3) Most natural method of

incorporating semantic information is be inferring the relatedness of terms from an external source of domain knowledge Example: WordNet Ontology

Semantic Distance Path length of hierarchical tree Information content

19

Latent Semantic Kernels (LSK)/ Latent Semantic Indexing (LSI)

Singular Value Decomposition (SVD):

where is a diagonal matrix of the same dimensions as D, and U and V are unitary matrices whose columns are the eigenvectors of D’D and DD’ respectively

LSI projects the documents into the space spanned by the first k columns of U, suing the new k-dimensional vectors for subsequent processing

where Uk is the matrix containing the first k columns of U

'' VUD

kUdd )(

20

Latent Semantic Kernels (LSK)/ Latent Semantic Indexing (LSI)

New kernel becomes that of Kernel PCA

LSK is implemented by projecting onto the features:

where k is the base kernel, and i vi are eigenvalue, eigenvector pairs of the kernel matrix

Can represent the LSK’s with the proximity matrix

)()(),(~

2121 dUUdddk kk

k

ijjjiik ddkvUd

11

2/1 ),()()(

kkUUP

21

String and Sequence An alphabet is a finite set of ||

symbols. A string s=s1…s|s| is any finite sequence of

symbols from including the empty sequence.

We denote n the set of all finite strings of length n

String matching: implies contiguity Sequence matching : only implies order

22

p-spectrum kernel (1) Features of s = p-spectrum of s =

histogram of all (contiguous) substrings of length p

Feature space indexed by all elements of p

u(s)=number of occurrences of u in s

The associated kernel is defined as)()(),( tstskp

u

p

upu

p

23

p-spectrum kernel example Example: 3-spectrum kernel

s=“statistics’ and t=“computation” The two strings contain the following

substrings of length 3: “sta”,”tat”, “ati”, “tis”, “ist”, “sti”, “tic”,

“ics” “com”, “omp”, “mpu”, “put”, “uta”, “tat”,

“ati”, “tio”, “ion” Common substrings of “tat” and “ati”,

so the inner product k(s,t)=2

24

p-spectrum Kernels Recursion

k-suffix kernel is defined by

p-spectrum kernel can be evaluated using the equation:

in O(p |s| |t|) operations The evaluation of one row of the table for the p-suffix

kernel corresponds to performing a search in the string t for the p-suffix of a prefix in s.

otherwise

uforuttussiftsk

ksk

0

,,1),( 11

):(),:(((),(1||

1

1||

1

pjjtpiissktskps

i

pt

j

spp

25

All-subsequences kernels Feature mapping defined by all

contiguous or non-contiguous subsequences of a string

Feature space indexed by all elements of *={}U U 2U 3U…

u(s)=number of occurrences of u as a (non-contiguous) subsequence of s

Explicit computation rapidly infeasible (exponential in |s| even with sparse rep.)

26

Recursive implementation Consider the addition of one extra symbol a to

s: common subsequences of (sa,t) are either in s or must end with symbol a (in both sa and t).

Mathematically,

This gives a complexity of O(|s||t|2)

atj j

jkkak

k

:

))1:1(t,s()t,s()t,s(

1),s(

27

Fixed-length subsequence kernels


u(s)=number of occurrences of the p-gram u as a (non-contiguous) subsequence of s

Recursive implementation (will create a series of p tables)

Complexity: O(p|s||t|) , but we have the k-length subseq. kernels (k<=p) for free easy to compute k(s,t)=alkl(s,t)

atjppp

p

j

jkkak

pfork

k

:1

0

))1:1(t,s()t,s()t,s(

00),s(

1),s(

28

Gap-weighted subsequence kernels (1)


u(s)=sum of weights of occurrences of the p-gram u as a (non-contiguous) subsequence of s, the weight being length penalizing: length(u)) [NB: length includes both matching symbols and gaps]

Example (1) The string “gon” occurs as a subsequence of the

strings “gone”, “going” and “galleon”, but we consider the first occurrence as more important since it is contiguous, while the final occurrence is the weakest of all three

29

Gap-weighted subsequence kernels (2) Example(2)

D1 : ATCGTAGACTGTC D2 : GACTATGC (D1)CAT = 28+and(D2)CAT = 4

k(D1,D2)CAT=212+214 Naturally built as a dot product valid

kernel For alphabet of size 80, there are 512,000

trigrams For alphabet of size 26, there are 12 x 106 5-

grams

30

Gap-weighted subsequence kernels (3)

Hard to perform explicit expansion and dot-product

Efficient recursive formulation (dynamic programming type), whose complexity is O(k |D1| |D2|)

31

Word Sequence Kernels (1)

Here “words” are considered as symbols Meaningful symbols more relevant matching Linguistic preprocessing can be applied to improve performance Shorter sequence sizes improved computation time But increased sparsity (documents are more : “orthogonal”)

Motivation : the noisy stemming hypothesis (important N-grams approximate stems), confirmed experimentally in a categorization task

32

Word Sequence Kernels (2)

Link between Word Sequence Kernels and other methods: For k=1, WSK is equivalent to basic “Bag Of Words”

approach For =1, close relation to polynomial kernel of degree k,

WSK takes order into account Extension of WSK:

Symbol dependant decay factors (way to introduce IDF concept, dependence on the POS, stop words)

Different decay factors for gaps and matches (e.g. noun<adj when gap; noun>adj when match)

Soft matching of symbols (e.g. based on thesaurus, or on dictionary if we want cross-lingual kernels)

33

Tree Kernels Application: categorization [one doc=one tree], parsing

(disambiguation) [one doc = multiple trees] Tree kernels constitute a particular case of more general

kernels defined on discrete structure (convolution kernels). Intuitively, the philosophy is to split the structured objects in parts, to define a kernel on the “atoms” and a way to recursively

combine kernel over parts to get the kernel over the whole. Feature space definition: one feature for each possible

proper subtree in the training data; feature value = number of occurrences

A subtree is defined as any part of the tree which includes more than one node, with the restriction there is no “partial” rule production allowed.

34

Trees in Text : example Example :

S

NP VP

V NJohn

loves Mary

S

NP VP

VP

V N

loves Mary

VP

V N

loves

N

Mary

VP

V NA Parse Tree

… a few among the many subtrees

of this tree!

35

Tree Kernels : algorithm Kernel = dot product in this high dimensional feature

space Once again, there is an efficient recursive algorithm (in

polynomial time, not exponential!) Basically, it compares the production of all possible

pairs of nodes (n1,n2) (n1T1, n2 T2); if the production is the same, the number of common subtrees routed at both n1 and n2 is computed recursively, considering the number of common subtrees routed at the common children

Formally, let kco-rooted(n1,n2)=number of common subtrees rooted at both n1 and n2

11 22

),(),( 2121Tn Tn

rootedco nnkTTk

36

All sub-tree kernel Kco-rooted(n1,n2)=0 if n1 or n2 is a leaf Kco-rooted(n1,n2)=0 if n1 and n2 have different

production or, if labeled, different label Else Kco-rooted(n1,n2)=

“Production” is left intentionally ambiguous to both include unlabelled tree and labeled tree

Complexity s O(|T1|.|T2|)

))),(),,((1( 21 inchinchkichildren

rootedco

37

References J. Shawe-Tayor and N. Cristianini, Kernel

Methods for Pattern Analysis, 2004 (Chapter 10 and 11)

J. Tian, “PCA/Kernel PCA for Image Denoising”, September 16, 2004

T. Gartner, “A Survey of Kernels for Structured Data”, ACM SIGKDD Explorations Newsletter, July 2003

N. Cristianini, “Latent Semantic Kernels”, Proceedings of ICML-01, 18th International Conference on Machine Learning, 2001

overview of kernel methods (part 2)

Documents

data selection5

data pointschoose

data pointsubtract

kernel matrixwhat

n data points

overview of kernel methods

sociodemographic data

appropriate kernel kform