string kernels
TRANSCRIPT
-
7/28/2019 String Kernels
1/38
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 1
String and Tree KernelsAlgorithms and Applications
S.V.N. Vishy [email protected]
National ICT of Australiaand
Indian Insitutute of Science
Bangalore, India
Joint work with Alex Smola
-
7/28/2019 String Kernels
2/38
Overview - I
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 2
Motivation
Exact Kernels on Strings
Definition and Examples
Suffix Trees
DefinitionMatching Statistics
Counting Substrings
Weights and Kernels
AnnotationWeighting Functions
Linear Time Prediction
-
7/28/2019 String Kernels
3/38
Overview - II
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 3
Sliding Windows
Position Dependent Weights
Kernels on Trees
Definition
Sorting Trees
Tree to String Conversion
Coarsening Levels
Inexact Kernels on Strings
DefinitionMismatch Kernel
Space Time Tradeoffs
Extensions and Future Work
-
7/28/2019 String Kernels
4/38
Motivation
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 4
Kernel Methods
Exciting theoretical boundsNumerous applications
Limited to vectorial data
Strings
Bio-informatics
Spam filtering
Internet search engines
String KernelsMust respect structure
Fast and easy to compute
Semantically meaningful
-
7/28/2019 String Kernels
5/38
Notation
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 5
Alphabet
Set of characters denoted by AString
Any x Ak for some k
Sentinel
Some $ / A used to terminate a string
Prefix/Suffix/Substring
Let x = uvw for some possibly empty u, v and w
u is called the prefix and w the suffix of xv is called a substring of x
We write v x
-
7/28/2019 String Kernels
6/38
String Kernels - I
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 6
Exact Matching Kernels
k(x, x) :=
sx,sxwss,s =
sA
nums(x)nums(x)ws.
Count all matching substrings
Flexible weighting schemeDifferent applications = different weights
Noise in training data = does not work :-(
Successful applications in bio-informatics (Vishwanathan
and Smola, 2002) (Leslie et. al., 2002)
Linear time algorithms using suffix trees
-
7/28/2019 String Kernels
7/38
Exact String Kernels
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 7
Bag of Characters
Counts single characters (Joachims, 1999). Set ws = 0 forall |s| > 1
Bag of Wordss is bounded by whitespace (Joachims, 1999)
Limited Range CorrelationsSet ws = 0 for all |s| > n given a fixed n
K-spectrum kernelAccount for matching substrings of length k (Leslie et al.,
2002). Set ws = 0 for all |s| = kGeneral Case
Quadratic time kernel computation (Haussler, 1998,Watkins, 1998), cubic time prediction
-
7/28/2019 String Kernels
8/38
Suffix Trees
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 8
Definition
Compact tree built from all the suffixes of a word. Suffixtree of ababc denoted by S(ababc).
/.-,()*+ab
wwoooooooooooooo
b cccc
cccc c$
**
/.-,()*+ 22 abc$
c$
cccc
cccc
/.-,()*+abc$
c$
cccc
cccc
/.-,()*+
/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+
Node label := unique path from the root
Suffix links are used to speed up parsing of strings
Suppose we are at a node ax then suffix links help us tojump to node x
Each internal node has a unique suffix link (McCreight,76)
-
7/28/2019 String Kernels
9/38
Suffix Trees Contd. . .
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 9
Properties
Represents all the substrings of the given stringCan be constructed in linear time
Offline algorithms - (Weiner 73) and (McCreight 76)
Online algorithm - (Ukkonen 93)
Can be stored using linear spaceEdges are encoded by indices of the substring
Each leaf corresponds to a unique suffix
Leaves on subtree give number of occurrences
Each internal node has at least 2 distinct childrenAnnotation by performing DFS is easy
LCA queries in constant time (linear pre-processing)
-
7/28/2019 String Kernels
10/38
Matching Statistics
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 10
Definition
Given strings x, y with |x| = n and |y| = m, the matchingstatistics of x with respect to y are defined by v, c Nn,where
vi is the length of the longest substring of y matching aprefix of x[i : n]
vi := i + vi 1
ci is a pointer to ceil(x[i : vi]) in S(y).
ceil(s) is the last node on the path from the root to s
This can be computed in linear time (Chang and Lawler,1994).
-
7/28/2019 String Kernels
11/38
Example
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 11
Matching statistic of abba with respect to S(ababc).
String a b b avi 2 1 2 1
ceil(ci) a b b b root
/.-,()*+ab
wwoooooooooooooo
b cccc
cccc c$
**
/.-,()*+ 22 abc$
c$
ccc
cccc
c/.-,()*+
abc$
c$
ccc
cccc
c/.-,()*+
/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+
-
7/28/2019 String Kernels
12/38
Matching Substrings
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 12
Prefixes
w is a substring of x iff there is an i such that w is a prefixof x[i : n]. The number of occurrences of w in x can becalculated by finding all such i.
SubstringsThe set of matching substrings of x and y is the set of allprefixes of x[i : vi].
Next StepIf we have a substring w of x, prefixes of w may occur inx with higher frequency. We need an efficient computation
scheme.
-
7/28/2019 String Kernels
13/38
Key Trick - I
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 13
Theorem
Let x and y be strings and c and v be the matching statis-tics of x with respect to y. Assume that
W(y, t) =
sprefix(v)
wus wu where u = ceil(t) and t = uv.
can be computed in constant time for any t. Then k(x, y)can be computed in O(|x| + |y|) time as
k(x, y) =
|x|
i=1
val(x[i : vi])
where val(s) indicates the contribution to the kernel due tostring s and all its prefixes.
K T i k II
-
7/28/2019 String Kernels
14/38
Key Trick - II
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 14
Observation
All substrings ending on the same edge of s(y) occur thesame number of times in string y
For any matching substring t we can write
val(t) := lvs(floor(t)) W(y, t) + val(ceil(t))
For each node v we can pre-compute val(v) by a simpleDFS on the suffix tree
We can compute in constant time
val(x[i : vi]) = lvs(floor(x[i : vi])) W(y, x[i : vi]) + val(ci)
C ti
-
7/28/2019 String Kernels
15/38
Computing W(y, t)
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 15
Length-Dependent Weights
Assume that ws = w|s|, then
W(y, t) =
|t|j=| ceil(t)|
wj w| ceil(t)| = |t| | ceil(t)|
where j :=
ji=1 wj, which can be pre-computed andstored for j = 1, 2, . . . max(|x|, |y|).
K-spectrum KernelIt is easy to see that
W(y, t) =
1 if | ceil(t)| < k and |t| k0 if otherwise
G i W i ht
-
7/28/2019 String Kernels
16/38
Generic Weights
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 16
TFIDF Weights
Assume ws = (freq(s))(|s|)Strings on the same edge = same frequency
W(y, t) = (freq(t))
|t|
i=| ceil(t)|+1
(i)
To take into account frequency in entire training set builda master suffix tree of all strings in the training set
In case weights are completely arbitrary it suffices to
annotate all the nodes of this master suffix tree (Vish-wanathan and Smola, 2002).
Th E ti ti P bl
-
7/28/2019 String Kernels
17/38
The Estimation Problem
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 17
ProblemFor prediction we need to compute
f(x) =
i ik(xi, x).
This depends on the number of SVs
Web search engines and spam filtering
Large number of SVs
Real time prediction is criticalKey Observation
We are repeatedly parsing the SV strings
Pre-processing the SV set can speed up prediction
IdeaWe can merge matching weights from all the SVs. All weneed is a master suffix tree!
Master S ffi Tree
-
7/28/2019 String Kernels
18/38
Master Suffix Tree
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 18
A collection of strings X = {x1, x2 . . . xm}
We define S(X) as a natural union of S(xi)Define X := x1$1x2$2 . . . xm$m
Construct S(X)
Prune away the extra labels on the leaves
The $is induce a extra log penalty in construction time
Aamir et. al. algorithm avoids this penalty
It uses a clever modification of the McCreight algorithm
We can construct the master suffix tree S(X) in O(
i |xi|)time
Estimation Algorithm
-
7/28/2019 String Kernels
19/38
Estimation Algorithm
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 19
A substring s in a Support Vector xi is counted (i.e. con-
tributes a weight value) i times to the kernelWe merge the suffix trees of all the Support Vectors into amaster suffix tree
In the master suffix tree we associate weight i with each
leaf derived from Support Vector xiFor a node v, we define lvs(v) as the sum of weights asso-ciated with the subtree rooted at v
The key theorem can now be applied unchanged to com-
pute f(x)Our algorithm runs in time linear in the size of x and isindependent of the size of the Support Vector set!
Sliding Windows
-
7/28/2019 String Kernels
20/38
Sliding Windows
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 20
Motivation
Bioinformatics: Estimate on windows of a long stringProblem
Long sequence x and a window width N are given
Estimate the function value for windows of length N
Observation
Interestingmatches = cross a window boundary
Algorithm
Compute the matching statistics for xAccount for cross boundary matches (at most N)
In practice very few matches cross the boundary
Expected sub-linear time for sliding a window!
A Picture Helps
-
7/28/2019 String Kernels
21/38
A Picture Helps
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 21
Positional Weights
-
7/28/2019 String Kernels
22/38
Positional Weights
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 22
Motivation
Bioinformatics: weigh exons and introns differentlyDefinition
k(x, x) :=
sx,sx
w(s,x)(s,x)s,s
where w(s,x) and (s,x) are position dependent.Observation
Each leaf of S(x) corresponds to a unique suffix of x
Algorithm
Assign a different weight to each leaf
As before sum the weights on each subtree
The original algorithm runs unchanged!
Tree Kernels
-
7/28/2019 String Kernels
23/38
Tree Kernels
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 23
Subset TreesSet of connected nodes of a tree T
Definition (Colins and Duffy, 2001)Denote by T, T trees and by t |= T a subset tree of T, then
k(T, T) = t|=T,t|=T
wtt,t.
Our Definition (Vishwanathan and Smola, 2002)In case we count matching subtrees then t |= T denotesthat t is a subtree of T and we get
k(T, T) =
t|=T,t|=T
wtt,t.
Permutation Invariance
-
7/28/2019 String Kernels
24/38
Permutation Invariance
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 24
ProblemWe want permutation invariance of unordered trees.
Example:The following two unordered trees are mirror images
/.-,()*+
ccc
cccc
c/.-,()*+
ccc
cccc
c
/.-,()*+ /.-,()*+
ccc
cccc
c /.-,()*+
ccc
cccc
c /.-,()*+
/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+
Solution
Sort trees before computing kernel
Maps equivalent trees to a single representative
Sorting Trees - I
-
7/28/2019 String Kernels
25/38
Sorting Trees - I
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 25
Sorting Rules
Assume existence of lexicographic order on labelsIntroduce symbols [, ] satisfy [< ], and that ], [==========================
-
7/28/2019 String Kernels
35/38
-
7/28/2019 String Kernels
36/38
Results
-
7/28/2019 String Kernels
37/38
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 37
Summary and Extensions
-
7/28/2019 String Kernels
38/38
y
S.V.N. Vishy Vishwanathan: String and Tree Kernels, Page 38
Reduction from quadratic or cubic to linear prediction andkernel computation time
Kernels on heaps, stacks, bags, etc. trivial
Compact storage of SVs if redundancies abound in SVset. E.g. for anagram and analphabet we need onlyanalphabet and gram
Ideas can also be extended to suffix arrays
Approximate matching and wildcards
Can we do better in case of approximate matches?
Automata and dynamical systems
Do expensive things with string kernel classifiers
More information at h t t p : / / w w w . k e r n e l - m a c h i n e s . o r g
http://www.kernel-machines.org/