overview of kernel methods (part 2)
DESCRIPTION
Overview of Kernel Methods (Part 2). Steve Vincent November 17, 2004. Overview. Kernel methods offer a modular framework. In a first step , a dataset is processed into a kernel matrix. Data can be of various types, and also heterogeneous types. - PowerPoint PPT PresentationTRANSCRIPT
1
Overview of Kernel Methods (Part 2)
Steve VincentNovember 17, 2004
2
Overview Kernel methods offer a modular framework. In a first step, a dataset is processed into a kernel matrix.
Data can be of various types, and also heterogeneous types. In a second step, a variety of kernel algorithms can be
used to analyze the data, using only the information contained in the kernel matrix
3
What will be covered today PCA vs. Kernel PCA
Algorithm Example Comparison
Text Related Kernel Methods Bag of Words Semantic Kernels String Kernels Tree Kernels
4
PCA algorithm1. Subtract the mean from all the data points 2. Compute the covariance matrix S=3. Diagonalize S to get its eigenvalues and
eigenvectors4. Retain c eigenvectors corresponding to the c
largest eigenvalues such that equals the desired variance to be captured
5. Project the data points on the eigenvectors
N
iik
Tn xN
xEa1
1
c
n
Tnnxx
1
N
j j
c
j j 11/
5
Kernel PCA algorithm (1)1. Given N data points in d dimensions let X={x1|
x2|..|xN} where each column represents one data point
2. Subtract the mean from all the data points3. Choose an appropriate kernel k4. Form the NxN Gram matrix K|ij=[k(xi,xj)}5. Form the modified Gram matrix
where 1NxN is an NxN matrix with all entries equal to 1
NIK
NIK NxN
T
NxN 11~
6
Kernel PCA algorithm (2)6. Diagonalize K to get its eigenvalues n and its
eigenvectors an
7. Normalize an
8. Retain c eigenvectors corresponding to c largest eigenvalues such that
equals desired variance to be captured9. Project the data points on the eigenvectors
nna /
N
jj
C
jj
11
/
NK
xxk
xxk
NIay NxN
N
NxNT 1
),(
...
),(1
1
7
Data Mining Problem Data Source
Computational Intelligence and Learning Cluster Challenge 2000, http://www.wi.leidenuniv.nl/~putten/library/cc2000/index.html
Supplied by Sentient Machine Research, http://www.smr.nl
Problem Definition: Given data which incorporates both socio-economic and
various insurance policy ownership attributes, can we derive models which help in determining factors or attributes which may influence or signify individuals who purchase a caravan insurance policy.
8
Data Selection 5,822 records for training 4,000 records for evaluation 86 attributes
Attributes 1 through 43: Socio-demographic data derived from zip code
areas Attributes 44 through 85:
Product ownership for customers Attribute 86
Purchased caravan insurance
9
Principal Components Analysis (PCA)
# Attributes
% Variance
# Attributes
% Variance
25 73.29 55 98.20
30 80.66 60 98.86
35 86.53 65 99.35
40 91.25 70 99.65
4545 94.9494.94 75 99.83
50 97.26 80 99.96
Data Transformation and Reduction
[From MatLab]
10
Relative Performance PCA run time: 6.138 Kernel PCA run time: 5.668
Used Radial Basis Function Kernel
Matlab Code for PCA and Kernel PCA algorithm can be supplied if needed
2
exp),(vu
vuK
11
Manually Reduced Dataset:Naïve Bayes Overall – 82.79% Correctly Classified
a b3155 544 a 14.71% False Positive132 98 b 42.61% Correctly Classified
PCA Reduced Dataset:Naïve Bayes Overall – 88.45% Correctly Classified
a b3507 255 a 6.77% False Positive207 31 b 13.03% Correctly Classified
Kernel PCA Reduced Dataset:Naïve Bayes Overall – 82.22% Correctly Classified
a b3238 541 a 14.3% False Positive175 74 b 29.7% Correctly Classified
Modeling, Test and Evaluation
* Legend: a – no, b yes
12
Overall Results KPCA and PCA had similar time
performance KPCA is much gives results closer
to manually reduced dataset Future Work:
Examine other Kernels Vary the parameters for the Kernels Use other Data Mining Algorithms
13
‘Bag of words’ kernels (1) Document seen as a vector d, indexed by all
the elements of a (controlled) dictionary. The entry is equal to the number of occurrences.
A training corpus is therefore represented by a Term-Document matrix,
noted D=[d1 d2 … dm-1 dm] From this basic representation, we will apply
a sequence of successive embeddings, resulting in a global (valid) kernel with all desired properties
14
BOW kernels (2) Properties:
All order information is lost (syntactical relationships, local context, …)
Feature space has dimension N (size of the dictionary) Similarity is basically defined by: k(d1,d2)=d1•d2= d1
t.d2
or, normalized (cosine similarity):
Efficiency provided by sparsity (and sparse dot-product algorithm): O(|d1|+|d2|)
),().,(
),(),(ˆ
2211
2121
ddkddk
ddkddk
15
Latent concept Kernels Basic idea :
documents
termstermstermstermsterms
Concepts space
Size t
Size k <<t
Size d
1
2
K(d1,d2)=?
16
Semantic Kernels (1) k(d1,d2)=(d1)SS’(d2)’
where S is the semantic matrix S can be defined as S=RP where
R is a diagonal matrix giving the term weightings or relevances
P is the proximity matrix defining the semantic spreading between the different terms of the corpus
The measure for the inverse document frequency for a term t is given by:
The matrix R is diagonal with entries: Rtt=w(t)
)(ln)(
tdftw
l=# of documents
df(t)=# of documents containing term t
17
Semantic Kernels (2) The associated kernel is:
For the proximity matrix (P) the associated kernel is:
Where Qij encodes the amount of semantic relation between terms i and j.
)'(')(),(~
2121 dRRdddk
jji
iji dQddPPdddk )()()'(')(),(~
2,
12121
18
Semantic Kernels (3) Most natural method of
incorporating semantic information is be inferring the relatedness of terms from an external source of domain knowledge Example: WordNet Ontology
Semantic Distance Path length of hierarchical tree Information content
19
Latent Semantic Kernels (LSK)/ Latent Semantic Indexing (LSI)
Singular Value Decomposition (SVD):
where is a diagonal matrix of the same dimensions as D, and U and V are unitary matrices whose columns are the eigenvectors of D’D and DD’ respectively
LSI projects the documents into the space spanned by the first k columns of U, suing the new k-dimensional vectors for subsequent processing
where Uk is the matrix containing the first k columns of U
'' VUD
kUdd )(
20
Latent Semantic Kernels (LSK)/ Latent Semantic Indexing (LSI)
New kernel becomes that of Kernel PCA
LSK is implemented by projecting onto the features:
where k is the base kernel, and i vi are eigenvalue, eigenvector pairs of the kernel matrix
Can represent the LSK’s with the proximity matrix
)()(),(~
2121 dUUdddk kk
k
ijjjiik ddkvUd
11
2/1 ),()()(
kkUUP
21
String and Sequence An alphabet is a finite set of ||
symbols. A string s=s1…s|s| is any finite sequence of
symbols from including the empty sequence.
We denote n the set of all finite strings of length n
String matching: implies contiguity Sequence matching : only implies order
22
p-spectrum kernel (1) Features of s = p-spectrum of s =
histogram of all (contiguous) substrings of length p
Feature space indexed by all elements of p
u(s)=number of occurrences of u in s
The associated kernel is defined as)()(),( tstskp
u
p
upu
p
23
p-spectrum kernel example Example: 3-spectrum kernel
s=“statistics’ and t=“computation” The two strings contain the following
substrings of length 3: “sta”,”tat”, “ati”, “tis”, “ist”, “sti”, “tic”,
“ics” “com”, “omp”, “mpu”, “put”, “uta”, “tat”,
“ati”, “tio”, “ion” Common substrings of “tat” and “ati”,
so the inner product k(s,t)=2
24
p-spectrum Kernels Recursion
k-suffix kernel is defined by
p-spectrum kernel can be evaluated using the equation:
in O(p |s| |t|) operations The evaluation of one row of the table for the p-suffix
kernel corresponds to performing a search in the string t for the p-suffix of a prefix in s.
otherwise
uforuttussiftsk
ksk
0
,,1),( 11
):(),:(((),(1||
1
1||
1
pjjtpiissktskps
i
pt
j
spp
25
All-subsequences kernels Feature mapping defined by all
contiguous or non-contiguous subsequences of a string
Feature space indexed by all elements of *={}U U 2U 3U…
u(s)=number of occurrences of u as a (non-contiguous) subsequence of s
Explicit computation rapidly infeasible (exponential in |s| even with sparse rep.)
26
Recursive implementation Consider the addition of one extra symbol a to
s: common subsequences of (sa,t) are either in s or must end with symbol a (in both sa and t).
Mathematically,
This gives a complexity of O(|s||t|2)
atj j
jkkak
k
:
))1:1(t,s()t,s()t,s(
1),s(
27
Fixed-length subsequence kernels
Feature space indexed by all elements of p
u(s)=number of occurrences of the p-gram u as a (non-contiguous) subsequence of s
Recursive implementation (will create a series of p tables)
Complexity: O(p|s||t|) , but we have the k-length subseq. kernels (k<=p) for free easy to compute k(s,t)=alkl(s,t)
atjppp
p
j
jkkak
pfork
k
:1
0
))1:1(t,s()t,s()t,s(
00),s(
1),s(
28
Gap-weighted subsequence kernels (1)
Feature space indexed by all elements of p
u(s)=sum of weights of occurrences of the p-gram u as a (non-contiguous) subsequence of s, the weight being length penalizing: length(u)) [NB: length includes both matching symbols and gaps]
Example (1) The string “gon” occurs as a subsequence of the
strings “gone”, “going” and “galleon”, but we consider the first occurrence as more important since it is contiguous, while the final occurrence is the weakest of all three
29
Gap-weighted subsequence kernels (2) Example(2)
D1 : ATCGTAGACTGTC D2 : GACTATGC (D1)CAT = 28+and(D2)CAT = 4
k(D1,D2)CAT=212+214 Naturally built as a dot product valid
kernel For alphabet of size 80, there are 512,000
trigrams For alphabet of size 26, there are 12 x 106 5-
grams
30
Gap-weighted subsequence kernels (3)
Hard to perform explicit expansion and dot-product
Efficient recursive formulation (dynamic programming type), whose complexity is O(k |D1| |D2|)
31
Word Sequence Kernels (1)
Here “words” are considered as symbols Meaningful symbols more relevant matching Linguistic preprocessing can be applied to improve performance Shorter sequence sizes improved computation time But increased sparsity (documents are more : “orthogonal”)
Motivation : the noisy stemming hypothesis (important N-grams approximate stems), confirmed experimentally in a categorization task
32
Word Sequence Kernels (2)
Link between Word Sequence Kernels and other methods: For k=1, WSK is equivalent to basic “Bag Of Words”
approach For =1, close relation to polynomial kernel of degree k,
WSK takes order into account Extension of WSK:
Symbol dependant decay factors (way to introduce IDF concept, dependence on the POS, stop words)
Different decay factors for gaps and matches (e.g. noun<adj when gap; noun>adj when match)
Soft matching of symbols (e.g. based on thesaurus, or on dictionary if we want cross-lingual kernels)
33
Tree Kernels Application: categorization [one doc=one tree], parsing
(disambiguation) [one doc = multiple trees] Tree kernels constitute a particular case of more general
kernels defined on discrete structure (convolution kernels). Intuitively, the philosophy is to split the structured objects in parts, to define a kernel on the “atoms” and a way to recursively
combine kernel over parts to get the kernel over the whole. Feature space definition: one feature for each possible
proper subtree in the training data; feature value = number of occurrences
A subtree is defined as any part of the tree which includes more than one node, with the restriction there is no “partial” rule production allowed.
34
Trees in Text : example Example :
S
NP VP
V NJohn
loves Mary
S
NP VP
VP
V N
loves Mary
VP
V N
loves
N
Mary
VP
V NA Parse Tree
… a few among the many subtrees
of this tree!
35
Tree Kernels : algorithm Kernel = dot product in this high dimensional feature
space Once again, there is an efficient recursive algorithm (in
polynomial time, not exponential!) Basically, it compares the production of all possible
pairs of nodes (n1,n2) (n1T1, n2 T2); if the production is the same, the number of common subtrees routed at both n1 and n2 is computed recursively, considering the number of common subtrees routed at the common children
Formally, let kco-rooted(n1,n2)=number of common subtrees rooted at both n1 and n2
11 22
),(),( 2121Tn Tn
rootedco nnkTTk
36
All sub-tree kernel Kco-rooted(n1,n2)=0 if n1 or n2 is a leaf Kco-rooted(n1,n2)=0 if n1 and n2 have different
production or, if labeled, different label Else Kco-rooted(n1,n2)=
“Production” is left intentionally ambiguous to both include unlabelled tree and labeled tree
Complexity s O(|T1|.|T2|)
))),(),,((1( 21 inchinchkichildren
rootedco
37
References J. Shawe-Tayor and N. Cristianini, Kernel
Methods for Pattern Analysis, 2004 (Chapter 10 and 11)
J. Tian, “PCA/Kernel PCA for Image Denoising”, September 16, 2004
T. Gartner, “A Survey of Kernels for Structured Data”, ACM SIGKDD Explorations Newsletter, July 2003
N. Cristianini, “Latent Semantic Kernels”, Proceedings of ICML-01, 18th International Conference on Machine Learning, 2001