text classification using string kernels

Journal of Machine Learning Research 2 (2002) 419-444 Submitted 12/00; Published 2/02

Text Classification using String Kernels

Huma Lodhi [email protected]

Craig Saunders [email protected]

John Shawe-Taylor [email protected]

Nello Cristianini [email protected]

Chris Watkins [email protected]

Department of Computer Science,Royal Holloway, University of London,Egham, Surrey TW20 0EX, UK

Editor: Bernhard Scholkopf

Abstract

We propose a novel approach for categorizing text documents based on the use of a specialkernel. The kernel is an inner product in the feature space generated by all subsequencesof length k. A subsequence is any ordered sequence of k characters occurring in the textthough not necessarily contiguously. The subsequences are weighted by an exponentiallydecaying factor of their full length in the text, hence emphasising those occurrences that areclose to contiguous. A direct computation of this feature vector would involve a prohibitiveamount of computation even for modest values of k, since the dimension of the featurespace grows exponentially with k. The paper describes how despite this fact the innerproduct can be efficiently evaluated by a dynamic programming technique.

Experimental comparisons of the performance of the kernel compared with a stan-dard word feature space kernel (Joachims, 1998) show positive results on modestly sizeddatasets. The case of contiguous subsequences is also considered for comparison with thesubsequences kernel with different decay factors. For larger documents and datasets thepaper introduces an approximation technique that is shown to deliver good approximationsefficiently for large datasets.

Keywords: Kernels and Support Vector Machines, String Subsequence Kernel, Approx-imating Kernels, Text Classification

1. Introduction

Standard learning systems (like neural networks or decision trees) operate on input dataafter they have been transformed into feature vectors d1, . . . , dn ∈ D living in an m-dimensional space. In such a space, the data points can be separated by a surface, clustered,interpolated or otherwise analysed. The resulting hypothesis will then be applied to testpoints in the same vector space, in order to make predictions.

There are many cases, however, where the input data cannot readily be described byexplicit feature vectors: for example biosequences, images, graphs and text documents.For such datasets, the construction of a feature extraction module can be as complex andexpensive as solving the entire problem. This feature extraction process not only requires

c©2002 Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini and Chris Watkins

Lodhi et al.

extensive domain knowledge, but also it is possible to lose important information duringthis process. These extracted features play a key role in the effectiveness of a system.

Kernel methods (KMs) are an effective alternative to explicit feature extraction. Thebuilding block of Kernel-based learning methods (KMs) (Cristianini and Shawe-Taylor,2000; Vapnik, 1995) is a function known as the kernel function, i.e. a function returningthe inner product between the mapped data points in a higher dimensional space. Thelearning then takes place in the feature space, provided the learning algorithm can beentirely rewritten so that the data points only appear inside dot products with other datapoints. Several linear algorithms can be formulated in this way, for clustering, classificationand regression. The most well known example of a kernel-based system is the SupportVector Machine (SVM) (Boser et al., 1992; Cristianini and Shawe-Taylor, 2000), but alsothe Perceptron, PCA, Nearest Neighbour, and many other algorithms have this property.The non-dependence of KMs on dimensionality of the feature space and flexibility of usingany kernel function makes them a good choice for different classification tasks especially fortext classification.

In this paper, we will exploit the important fact that kernel functions can be definedover general sets (Scholkopf, 1997; Watkins, 2000; Haussler, 1999), by assigning to eachpair of elements (strings, graphs, images) an ‘inner product’ in a feature space. For suchkernels, it is not necessary to invoke Mercer’s theorem, as they can be directly shown tobe inner products. We examine the use of a kernel method based on string alignment fortext categorization problems. By defining inner products between text documents, one canuse any of the general purpose algorithms from this rich class. So text can be clustered,classified, ranked, etc. This paper builds on preliminary results present in Lodhi et al.(2001).

A standard approach (Joachims, 1998) to text categorization makes use of the classicaltext representation technique (Salton et al., 1975) that maps a document to a high dimen-sional feature vector, where each entry of the vector represents the presence or absence of afeature. This approach loses all the word order information only retaining the frequency ofthe terms in the document. This is usually accompanied by the removal of non-informativewords (stop words) and by the replacing of words by their stems, so losing inflection infor-mation. Such sparse vectors can then be used in conjunction with many learning algorithms.This simple technique has recently been used very successfully in supervised learning taskswith Support Vector Machines (Joachims, 1998).

In this paper we propose a radically different approach, that considers documents simplyas symbol sequences, and makes use of specific kernels. The approach does not use anydomain knowledge, in the sense that it considers the document just as a long sequence,and nevertheless is capable of capturing topic information. The feature space in this case isgenerated by the set of all (non-contiguous) substrings of k-symbols, as described in detailin Section 3. The more substrings two documents have in common, the more similar theyare considered (the higher their inner product).

We build on recent advances (Watkins, 2000; Haussler, 1999) that demonstrate how tobuild kernels over general structures like sequences. The most remarkable property of suchmethods is that they map documents to feature vectors without explicitly representingthem, by means of sequence alignment techniques. A dynamic programming techniquemakes the computation of the kernels very efficient (linear in the documents length).

420


We empirically analyse this approach and present experimental results on a set of doc-uments containing stories from Reuters news agency, the Reuters dataset. We comparethe proposed approach to the classical text representation technique (also known as thebag-of-words) and the n-grams based text representation technique, demonstrating thatthe approach delivers state-of-the-art performance in categorization, and can outperformthe bag-of-words approach.

The experimental analysis of this technique showed that it suffers practical limitationsfor big text corpora. This establishes a need to develop an approximation strategy. Fur-thermore, for text categorization tasks there is now a large variety of problems for which thedatasets are huge. It is therefore important to find methods which can efficiently computethe Gram matrix.

One way to reduce computation time would be to provide a method which quickly com-putes an approximation, instead of evaluating the full kernel. Provided the approximationof the Gram matrix can be shown not to deviate significantly from that produced by the fullstring kernel various kernel methods can then be applied to large text-based datasets. Inthis paper we show how one can successfully approximate the Gram matrix by consideringonly a subset of the features which are generated by the string kernel. We use the recentlyproposed alignment measure (Cristianini et al., to appear) to show the deviation from thetrue Gram matrix. Remarkably few features are needed in order to approximate the fullmatrix, and therefore computation time is greatly reduced (by several orders of magnitude).In order to show the effectiveness of this method, we conduct an experiment which uses thestring subsequence kernel (SSK) on the full Reuters dataset.

2. Kernels and Support Vector Machines

This section reviews the main ideas behind Support Vector Machines (SVMs) and kernelfunctions. SVMs are a class of algorithms that combine the principles of statistical learningtheory with optimisation techniques and the idea of a kernel mapping. They were introducedby Boser et al. (1992), and in their simplest version they learn a separating hyperplanebetween two sets of points so as to maximise the margin (distance between plane andclosest point). This solution has several interesting statistical properties, that make ita good candidate for valid generalisation. One of the main statistical properties of themaximal margin solution is that its performance does not depend on the dimensionality ofthe space where the separation takes place. In this way, it is possible to work in very highdimensional spaces, such as those induced by kernels, without overfitting.

In the classification case, SVMs work by mapping the data points into a high dimen-sional feature space, where a linear learning machine is used to find a maximal marginseparation. In the case of kernels defined over a space, this hyperplane in the feature spacecan correspond to a nonlinear decision boundary in the input space. In the case of kernelsdefined over sets, this hyperplane simply corresponds to a dichotomy of the input set.

We now briefly describe a kernel function. A function that calculates the inner productbetween mapped examples in a feature space is a kernel function, that is for any mappingφ : D → F , K(di, dj) = 〈φ(di), φ(dj)〉 is a kernel function. Note that the kernel computesthis inner product by implicitly mapping the examples to the feature space. The mapping

421

Lodhi et al.

φ transforms an n dimensional example into an N dimensional feature vector.

φ(d) = (φ1(d), . . . , φN (d)) = (φi(d)) for i = 1, . . . , N

The explicit extraction of features in a feature space generally has very high computationalcost but a kernel function provides a way to handle this problem. The mathematical founda-tion of such a function was established during the first decade of twentieth century (Mercer,1909). A kernel function is a symmetric function,

K(di, dj) = K(dj , di), for i, j = 1, . . . , n.

The n × n matrix with entries of the form Kij = K(di, dj) is known as the kernel matrix.A kernel matrix is a symmetric, positive definite matrix. It is interesting to note thatthis matrix is the main source of information for KMs and these methods use only thisinformation to learn a classifier. There are ways of combining simple kernels to obtain morecomplex ones.

For example given a kernel K and a set of n vectors the polynomial construction is givenby

Kpoly(di, dj) = (K(di, dj) + c)p

where p is a positive integer and c is a nonnegative constant. Clearly, we incur a smallcomputational cost, to define a new feature space. The feature space corresponding toa degree p polynomial kernel includes all products of at most p input features. Hencepolynomial kernels create images of the examples in feature spaces having huge numbers ofdimensions.

Furthermore, Gaussian kernels define feature space with infinite number of dimensionand it is given by

Kgauss(di, dj) = exp(−‖di−dj‖2

2σ2

)

A Gaussian kernel allows an algorithm to learn a linear classifier in an infinite dimensionalfeature space.

3. A Kernel for Text Sequences- A Step beyond Words

In this section we describe a kernel between two text documents. The idea is to comparethem by means of the substrings they contain: the more substrings in common, the moresimilar they are. An important part is that such substrings do not need to be contiguous,and the degree of contiguity of one such substring in a document determines how muchweight it will have in the comparison.

For example: the substring ‘c-a-r’ is present both in the word ‘card’ and in the word‘custard’, but with different weighting. For each such substring there is a dimension ofthe feature space, and the value of such coordinate depends on how frequently and howcompactly such string is embedded in the text. In order to deal with non-contiguoussubstrings, it is necessary to introduce a decay factor λ ∈ (0, 1) that can be used to weightthe presence of a certain feature in a text (see Definition 1 for more details).Example. Consider - as simple documents - the words cat, car, bat, bar. If we consider onlyk = 2, we obtain an 8-dimensional feature space, where the words are mapped as follows:

422


c-a c-t a-t b-a b-t c-r a-r b-rφ(cat) λ2 λ3 λ2 0 0 0 0 0φ(car) λ2 0 0 0 0 λ3 λ2 0φ(bat) 0 0 λ2 λ2 λ3 0 0 0φ(bar) 0 0 0 λ2 0 0 λ2 λ3

Hence, the unnormalised kernel between car and cat is K(car,cat) = λ4, whereas the nor-malised version is obtained as follows: K(car,car) = K(cat,cat) = 2λ4 + λ6 and henceK(car,cat) = λ4/(2λ4 + λ6) = 1/(2 + λ2). Note that in general the document will containmore than one word, but the mapping for the whole document is into one feature space:the catenation of all the words and the spaces (ignoring the punctuation) is considered asa unique sequence.Example. We can compute the similarity between the two parts of a famous line by Kant.

K(“science is organized knowledge”,“wisdom is organized life”)

The values for this kernel, and values of k = 1, 2, 3, 4, 5, 6 are: K1 = 0.580, K2 = 0.580,K3 = 0.478, K4 = 0.439, K5 = 0.406, K6 = 0.370

However, for interesting substring sizes (eg k > 4) and normal sized documents, directcomputation of all the relevant features would be impractical (even for moderately sizedtexts) and hence explicit use of such representation would be impossible. But it turnsout that a kernel using such features can be defined and calculated in a very efficientway by using dynamic programming techniques. We derive the kernel by starting fromthe features and working out their inner product. In this case there is no need to provethat it satisfies Mercer’s conditions (symmetry and positive semi-definiteness) since theywill follow automatically from its definition as an inner product. This kernel named asstring subsequence kernel (SSK) is based on work (Watkins, 2000; Haussler, 1999) mostlymotivated by bioinformatics applications. It maps strings to a feature vector indexed by allk tuples of characters. A k-tuple will have a non-zero entry if it occurs as a subsequenceanywhere (not necessarily contiguously) in the string. The weighting of the feature willbe the sum over the occurrences of the k-tuple of a decaying factor of the length of theoccurrence.

Definition 1 (String subsequence kernel- SSK) Let Σ be a finite alphabet. A string isa finite sequence of characters from Σ, including the empty sequence. For strings s, t,we denote by |s| the length of the string s = s1 . . . s|s|, and by st the string obtained byconcatenating the strings s and t. The string s[i : j] is the substring si . . . sj of s. We say thatu is a subsequence of s, if there exist indices i = (i1, . . . , i|u|), with 1 ≤ i1 < · · · < i|u| ≤ |s|,such that uj = sij , for j = 1, . . . , |u|, or u = s[i] for short. The length l(i) of the subsequencein s is i|u| − i1 + 1. We denote by Σn the set of all finite strings of length n, and by Σ∗ theset of all strings

Σ∗ =∞⋃

n=0

Σn. (1)

We now define feature spaces Fn = RΣn

. The feature mapping φ for a string s is given bydefining the u coordinate φu(s) for each u ∈ Σn. We define

φu(s) =∑

i:u=s[i]

λl(i), (2)

423

Lodhi et al.

for some λ ≤ 1. These features measure the number of occurrences of subsequences in thestring s weighting them according to their lengths. Hence, the inner product of the featurevectors for two strings s and t give a sum over all common subsequences weighted accordingto their frequency of occurrence and lengths

Kn(s, t) =∑

u∈Σn

〈φu(s) · φu(t)〉 =∑

u∈Σn

∑i:u=s[i]

λl(i)∑

j:u=t[j]

λl(j)

=∑

u∈Σn

∑i:u=s[i]

∑j:u=t[j]

λl(i)+l(j).

A direct computation of these features would involve O(|∑ |n) time and space, since thisis the number of features involved. It is also clear that most of the features will havenon zero components for large documents. In order to derive an effective procedure forcomputing such kernel, we introduce an additional function which will aid in defining arecursive computation for this kernel. Let

K ′i(s, t) =

∑u∈Σi

∑i:u=s[i]

∑j:u=t[j]

λ|s|+|t|−i1−j1+2,

i = 1, . . . , n− 1,that is counting the length from the beginning of the particular sequence through to theend of the strings s and t instead of just l(i) and l(j). We can now define a recursivecomputation for K ′

i and hence compute Kn,

Definition 2 Recursive computation of the subsequence kernel.

K ′0(s, t) = 1, for all s, t,

K ′i(s, t) = 0, if min (|s|, |t|) < i,

Ki(s, t) = 0, if min (|s|, |t|) < i,K ′

i(sx, t) = λK ′i(s, t) +

∑j:tj=x

K ′i−1(s, t[1 : j − 1])λ|t|−j+2,

i = 1, . . . , n− 1,Kn(sx, t) = Kn(s, t) +

∑j:tj=x

K ′n−1(s, t[1 : j − 1])λ2.

Notice that we need the auxiliary function K ′ since it is only the interior gaps in thesubsequences that are penalised. The correctness of this recursion follows from observinghow the length of the strings has increased, incurring a factor of λ for each extra lengthunit. Hence, in the formula for K ′

i(sx, t), the first term has one fewer character, so requiringa single λ factor, while the second has |t| − j + 2 fewer characters. For the last formula thesecond term requires the addition of just two characters, one to s and one to t[1 : j − 1],since x is the last character of the n-sequence. If we wished to compute Kn(s, t) for a rangeof values of n, we would simply perform the computation of K ′

i(s, t) up to one less than thelargest n required, and then apply the last recursion for each Kn(s, t) that is needed usingthe stored values of K ′

i(s, t). We can of course create a kernel K(s, t) that combines thedifferent Kn(s, t) giving different (positive) weightings for each n.

424


Once we have create such a kernel it is natural to normalise to remove any bias introducedby document length. We can produce this effect by normalising the feature vectors in thefeature space. Hence, we create a new embedding φ(s) = φ(s)

‖φ(s)‖ , which gives rise to thekernel

K(s, t) =⟨φ(s) · φ(t)

⟩=

⟨φ(s)‖φ(s)‖ · φ(t)

‖φ(t)‖⟩

=1

‖φ(s)‖ ‖φ(t)‖ 〈φ(s) · φ(t)〉 = K(s, t)√K(s, s)K(t, t)

Efficient Computation of SSK

SSK measures the similarity between documents s and t in a time proportional to n|s||t|2,where n is the length of the sequence. It is evident from the description of the recursion inDefinition 2, as the outermost recursion is over the sequence length and for each length andeach additional character in s and t a sum over the sequence t must be evaluated. Howeverit is possible to to speed up the computation of SSK. We now present an efficient recursivecomputation of SSK that reduce the complexity of the computation to O(n|s||t|), by firstevaluating

K ′′i (sx, t) =

∑j:tj=x

K ′i−1(s, t[1 : j − 1])λ|t|−j+2

and observing that we can then evaluate K ′i(s, t) with the O(|s||t|) recursion,

K ′i(sx, t) = λK

′i(s, t) +K

′′i (sx, t).

Now observe that K ′′i (sx, tu) = λ

|u|K ′′i (sx, t), provided x does not occur in u, while

K ′′i (sx, tx) = λ

(K ′′

i (sx, t) + λK′i−1(s, t)

).

These observations together give an O(|s||t|) recursion for computing K ′′i (s, t). Hence, we

can evaluate the overall kernel in O(n|s||t|) time.

n-grams- A Language Independent Approach

n-grams is a language independent text representation technique. It transforms documentsinto high dimensional feature vectors where each feature corresponds to a contiguous sub-string. n-grams are n adjacent characters (substring) from the alphabet A. Hence, thenumber of distinct n-grams in a text is less than or equal to |A|n. This shows that thedimensionality of the n-grams feature vector can be very high even for moderate values ofn. However all these n-grams are not present in a document, thus reducing the dimension-ality substantially. For example there are 8727 unique tri-grams (excluding stop words) inthe Reuters dataset. Generally during n-grams feature vector formation all the upper-casecharacters are converted into lower-case characters and space is assumed for punctuation.The feature vectors are then normalised. This is illustrated in the following example.Example Consider an example that compute a tri-gram, and quad-gram feature vector.d = “support vector”The 3-grams are sup upp ppo por ort rt� t�v �ve vec ect cto tor, while the 4-grams are

425

Lodhi et al.

supp uppo ppor port ort� rt�v t�ve �vec ecto ctor.where � represents a space. Systems based on this technique have been applied in situationswhere the text suffers from errors such as misspelling (Cavnar, 1994; Huffman, 1995). Thechoice of an optimal n varies with text corpora.

Efficient Implementation

Since even with the speed up described above the computation of SSK is not cheap, more ef-ficient techniques were needed. Our goal to evaluate the performance of SSK in conjunctionwith SVM on different splits of data also required some special properties in the software.We used a simple gradient based implementation of SVMs (Friess et al., 1998; Cristianiniand Shawe-Taylor, 2000). The key to success of our system is a form of chunking. We startwith a very small subset of the data and gradually build up the size of the training set,while ensuring that only points which failed to meet margin 1 on the current hypothesiswere included in the next chunk.

Since each evaluation of the kernel function requires significant computational resources,we designed the system to only calculate those entries of the kernel matrix that are actuallyrequired by the training algorithm. This can significantly reduce the training time, sinceonly a relatively small part of the kernel matrix is actually used by our implementationof the SVM. The number of kernel evaluations are approximately equal to the size of thesample times the number of support vectors.

Once computed kernel entries were all saved for reuse on different splits of the data.This property makes it possible to train a classifier for a number of splits of data withoutincurring significant additional computational cost, provided there is overlap in the supportvectors for each split. The key idea is to save all the kernel entries evaluated during thetraining and test phase and use the kernel matrix with computed entries to evaluate SSKon a new split of the same data or to learn a different category of the same data.

4. Experimental Results

In this section we describe the experiments, while the emphasis of the experiments is onthe understanding of how SSK works in practice. The objectives of the experiments are to

• observe the influence of variability of tunable parameters k (length) and λ (weight)on performance,

• advantages of combining different kernels.In order to accomplish these goals we conducted a series of experiments on a subset ofdocuments from the Reuters dataset.Reuters DatasetThe Reuters dataset contains stories from Reuters news agency. We used Reuters-21578,the newer version of the corpus. It was compiled by David Lewis in 1987 and is publiclyavailable at http://www.research.att.com/lewis. To obtain a training set and test set thereexist different splits of the corpus. We used the Modified Apte (“ModeApte”) split. The“ModeApte” split comprises 9603 training and 3299 test documents. A Reuters categorycan contain as few as 1 or as many as 2877 documents in the training set. Similarly a test

426


set category can has as few as 1 or as many as 1066 relevant documents. The experimentsdescribed in Section 7 were conducted on the full Reuters dataset using the ModeApte split.

As mentioned above, the experiments presented in this section were performed on asubset of the Reuters dataset. We set the size of the subset so that the computation of SSKwas no longer a concern. The size of subset of Reuters was set to 470 documents, using 380documents for training the classifier and evaluating the performance of the learned classifieron a test set of 90 documents. The next step was to choose the categories. “Earn” and“acquisition” are most frequent categories of the Reuters dataset. The direct correspondencebetween the respective words and the categories “crude” and “corn” make them potentialcandidates. The splits of the data had the following sizes and numbers of positive examplesin training and test sets: numbers of positive examples in training (testing) set out of 370(90): earn 152 (40); acquisition 114 (25); crude 76 (15); corn 38 (10).

We now describe the preprocessing stage for SSK. We removed the words that occur ina stop list and punctuation, keeping spaces in their original places in the documents.

The performance of SSK was compared to the performance of the standard word kernel(WK) and n-grams kernel (NGK), where WK is a linear kernel that measures the similaritybetween documents that are indexed by words with tfidf weighting scheme. Similarly NGKis also a linear kernel that return a similarity score between documents that are indexed byn-grams.

In order to learn an SVM classifier in conjunction with WK we preprocessed the doc-uments as described. Stop words and punctuation were removed form the documents. Weweighted the entries of the feature vectors by using a variant of tfidf , log(1+tf)∗log(n/df),weighting scheme. Here tf represents term frequency while df is used for document fre-quency and n is the total number of documents. The documents are normalised so thateach document has equal length.

We now describe the preprocessing stage for n-grams feature vectors. For consistency,we removed stop words and punctuation. Each document in the collection is transformedinto a feature vector, where each entry of the feature vector represents the number of timesthe corresponding substring occurs in the document. Note that the feature vectors arenormalised.

For evaluation we used the F1 performance measure. It is given by 2pr/(p+ r), wherep is the precision and r is the recall, has been used. Note that then F1 measure gives equalweighting to both precision and recall. The parameter C was tuned by conducting verypreliminary experiments on one split of data for one category. Note that the the value ofC was set using standard WK and the chosen value was used for all the kernels and for allthe categories.

4.1 Effectiveness of Varying Sequence length

The effectiveness of a text-categorization system based on SSK can be controlled by thefree parameters, “length of a subsequence k” and “ weight decay parameter λ”. In order tounderstand the role of SSK for text categorization, it is important to study the performanceof a classifier such as SVM in conjunction with SSK by varying k and λ. Note that foreach new value of these parameters, we obtain a new kernel and in turn the resultantkernel matrix contains new information. We studied the effect of varying parameters by

427

Lodhi et al.

Category Kernel Length F1 Precision RecallMean SD Mean SD Mean SD

earn SSK 3 0.925 0.036 0.981 0.030 0.878 0.0574 0.932 0.029 0.992 0.013 0.888 0.0525 0.936 0.036 0.992 0.013 0.888 0.0676 0.936 0.033 0.992 0.013 0.888 0.0607 0.940 0.035 0.992 0.013 0.900 0.0648 0.934 0.033 0.992 0.010 0.885 0.05810 0.927 0.032 0.997 0.009 0.868 0.05412 0.931 0.036 0.981 0.025 0.888 0.05814 0.936 0.027 0.959 0.033 0.915 0.041

NGK 3 0.919 0.035 0.974 0.036 0.873 0.0624 0.943 0.030 0.992 0.013 0.900 0.0555 0.944 0.026 0.992 0.013 0.903 0.0516 0.943 0.030 0.992 0.013 0.900 0.0557 0.940 0.035 0.992 0.013 0.895 0.0648 0.940 0.045 0.992 0.013 0.895 0.06310 0.932 0.032 0.990 0.015 0.885 0.05312 0.917 0.033 0.975 0.024 0.868 0.05314 0.923 0.034 0.973 0.033 0.880 0.055

WK 0.925 0.033 0.989 0.014 0.867 0.057acq SSK 3 0.785 0.040 0.863 0.060 0.724 0.064

4 0.822 0.047 0.898 0.045 0.760 0.0685 0.867 0.038 0.914 0.042 0.828 0.0576 0.876 0.051 0.934 0.043 0.828 0.0807 0.864 0.045 0.920 0.046 0.816 0.0638 0.852 0.049 0.918 0.051 0.796 0.0649 0.820 0.056 0.903 0.053 0.756 0.08910 0.791 0.067 0.848 0.072 0.744 0.08312 0.791 0.067 0.848 0.072 0.744 0.08314 0.774 0.042 0.819 0.067 0.736 0.043

NGK 3 0.791 0.043 0.842 0.061 0.748 0.0534 0.873 0.031 0.896 0.037 0.852 0.0385 0.882 0.038 0.912 0.041 0.856 0.0516 0.880 0.045 0.923 0.041 0.844 0.0727 0.870 0.050 0.904 0.047 0.844 0.0858 0.857 0.044 0.897 0.039 0.824 0.07110 0.830 0.045 0.887 0.063 0.784 0.07112 0.806 0.066 0.850 0.062 0.768 0.07914 0.776 0.060 0.814 0.061 0.744 0.076

W-K 0.802 0.072 0.843 0.067 0.768 0.090

Table 1: The performance (F1, precision, recall) of SVM with SSK, NGK and WK forReuters categories earn and acq. Results illustrate the effect of the variability ofsubsequence length on performance. The results are averaged over 10 runs of thetechniques. We also report standard deviation.

428


Category Kernel Length F1 Precision RecallMean SD Mean SD Mean SD

crude SSK 3 0.881 0.077 0.931 0.101 0.853 0.1294 0.905 0.090 0.980 0.032 0.853 0.1435 0.936 0.045 0.979 0.033 0.900 0.0786 0.901 0.051 0.990 0.031 0.834 0.0797 0.872 0.050 0.963 0.052 0.800 0.0788 0.828 0.066 0.935 0.062 0.747 0.08810 0.764 0.098 0.919 0.095 0.660 0.11112 0.709 0.111 0.901 0.095 0.593 0.12714 0.761 0.106 0.897 0.066 0.680 0.146

NGK 3 0.907 0.060 0.993 0.021 0.840 0.1004 0.935 0.041 0.961 0.053 0.913 0.0635 0.937 0.048 0.968 0.045 0.913 0.0836 0.908 0.041 0.958 0.037 0.867 0.0707 0.904 0.054 0.957 0.048 0.860 0.0808 0.869 0.060 0.921 0.062 0.827 0.09010 0.811 0.090 0.903 0.083 0.740 0.11112 0.737 0.098 0.870 0.130 0.647 0.10414 0.884 0.171 0.944 0.094 0.847 0.222

W-K 0.904 0.043 0.910 0.082 0.907 0.064corn SSK 3 0.665 0.169 0.940 0.077 0.540 0.190

4 0.783 0.103 0.924 0.086 0.690 0.1375 0.779 0.104 0.886 0.094 0.700 0.1256 0.749 0.096 0.919 0.098 0.640 0.1177 0.643 0.107 0.897 0.095 0.510 0.1208 0.569 0.099 0.893 0.097 0.430 0.11610 0.582 0.107 0.912 0.097 0.440 0.12612 0.618 0.086 0.883 0.114 0.490 0.11014 0.702 0.114 0.860 0.123 0.610 0.152

NGK 3 0.797 0.068 0.911 0.081 0.720 0.1144 0.841 0.071 0.904 0.107 0.800 0.1055 0.847 0.103 0.912 0.092 0.800 0.1416 0.815 0.089 0.939 0.060 0.730 0.1347 0.767 0.117 0.953 0.078 0.650 0.1438 0.706 0.125 0.912 0.094 0.590 0.16010 0.646 0.113 0.890 0.970 0.520 0.13212 0.675 0.131 0.931 0.092 0.540 0.14314 0.813 0.174 0.933 0.125 0.740 0.232

W-K 0.762 0.099 0.833 0.065 0.710 0.137

Table 2: The performance (F1, precision, recall) of SVM with SSK, NGK and WK forReuters categories crude and corn. Results illustrate the effect of the variabilityof subsequence length on performance. The results are averaged over 10 runs ofthe techniques. We also report standard deviation.

429

Lodhi et al.

adopting the experimental methodology as described. For the first set of experiments wekept the value of one parameter λ fixed and learned a classifier for different values ofk. We conducted another set of experiments to observe how the performance is affectedby varying the parameter λ. Finally we empirically studied the advantages of combiningdifferent kernels. This section describes the first set of experiments.

For these experiments the value of weight decay parameter was set to 0.5, and sequencelength was varied. SSK was compared to NGK, where the n-grams length was also variedover a range of values. The effectiveness of SSK was also compared with the effectiveness ofWK. Tables 1 and 2 describe the results of these experiments, where precision, recall andF1 numbers are shown for all three kernels. Note that these results are averaged over 10runs of the algorithm.

From these results, we find that the performance of the classifier varies with respect tovarying sequence length. SSK can be more effective for smaller or moderate substrings ascompare to larger substrings. As results show, an optimal size of sequence length can befound in a region that is not very large. For each category the F1 numbers (with respect toSSK) seem to peak at a sequence length between 4 to 7. It seems that shorter or moderatenon-contiguous substrings are able to capture the semantics better than the longer non-contiguous substrings. In practice, the size of sequence length can be set by a validationset for each category.

Tables 1 and 2 also present the results of NGK and WK. We first focus on the per-formance of an SVM classifier with NGK and compare the role of NGK with SSK fortext categorization. It is interesting to note that generalisation performance of both thetechniques is comparable, where NGK works on contiguous substrings and SSK works onnon-contiguous substrings. The results show that the generalisation performance of an SVMclassifier in conjunction with NGK is higher for short substrings and for longer substringsthe performance of NGK is worse.

The classical text representation technique (WK) was also compared with SSK. It isworth noting that the performance of SSK is better that WK for each category. The sizeof the dataset can be one factor responsible for the degradation in the performance of WK,but these results show that SSK is an effective technique.

4.2 Effectiveness of Varying Weight Decay Factor

In this set of experiments, we analyse the effect of varying λ on the generalisation perfor-mance of an SVM learner that manipulates the information encoded in a string subsequencekernel. SSK weights the substrings according to their proximity in the text. The highervalues of λ place more weights to non-contiguous substrings and vice versa. In other wordsthis is the parameter that controls the penalisation of the interior gaps in the substrings.SSK was compared to NGK and WK. We once again evaluate the performance of thesetechniques by averaging the results over 10 runs of the algorithm. A series of experimentswas conducted to study the performance of a text-categorization system based on SSK bywidely varying the weight decay parameter. The results of this set of experiments are de-scribed in Tables 3 and 4. The average F1, precision and recall are given and note thatthese tables also show the standard deviations. The value of k was set to 5. It was a difficultchoice, since as shown in the preceding section different categories obtained a highest value

430


Category Kernel λ F1 Precision RecallMean SD Mean SD Mean SD

earn NGK 0 0.944 0.026 0.992 0.013 0.903 0.051SSK 0.01 0.946 0.028 0.992 0.013 0.905 0.052

0.03 0.946 0.028 0.992 0.013 0.905 0.0520.05 0.944 0.026 0.992 0.013 0.903 0.0510.07 0.944 0.026 0.992 0.013 0.903 0.0510.09 0.944 0.026 0.992 0.013 0.902 0.0510.1 0.944 0.026 0.992 0.013 0.903 0.0510.3 0.943 0.029 0.992 0.013 0.900 0.0550.5 0.936 0.0130 0.992 0.014 0.888 0.0670.7 0.928 0.040 0.994 0.012 0.873 0.062.9 0.914 0.050 0.989 0.020 0.853 0.075

W-K 0.925 0.014 0.989 0.014 0.867 0.057acq NGK 0 0.882 0.038 0.912 0.041 0.856 0.051

SSK 0.01 0.873 0.040 0.910 0.040 0.840 0.0500.03 0.878 0.040 0.908 0.040 0.852 0.0570.05 0.882 0.037 0.912 0.040 0.856 0.0540.07 0.873 0.044 0.910 0.041 0.840 0.0630.09 0.863 0.043 0.908 0.041 0.824 0.0630.1 0.871 0.043 0.903 0.038 0.844 0.0690.3 0.870 0.040 0.911 0.051 0.836 0.0610.5 0.867 0.038 0.914 0.042 0.828 0.0670.7 0.805 0.050 0.935 0.046 0.712 0.0780.9 0.735 0.073 0.850 0.064 0.652 0.092

WK 0.802 0.033 0.843 0.067 0.768 0.057

Table 3: The performance (F1, precision, recall) of SVM with SSK, NGK and WK forReuters categories earn and acq. Results illustrate the impact of varying λ onperformance of SSK. The results are averaged over 10 runs of the techniques. Wealso report standard deviation.

431

Lodhi et al.

Category Kernel λ F1 Precision RecallMean SD Mean SD Mean SD

crude NGK 0 0.937 0.048 0.968 0.045 0.913 0.083SSK 0.01 0.937 0.048 0.968 0.045 0.913 0.083

0.03 0.941 0.041 0.968 0.045 0.920 0.0690.05 0.945 0.041 0.974 0.044 0.920 0.0690.07 0.945 0.041 0.974 0.044 0.920 0.0690.09 0.927 0.052 0.987 0.027 0.880 0.0980.1 0.947 0.039 0.980 0.032 0.920 0.0690.3 0.948 0.030 0.980 0.032 0.920 0.0520.5 0.936 0.045 0.979 0.033 0.900 0.0780.7 0.893 0.363 0.993 0.022 0.813 0.0620.9 0.758 0.861 0.810 0.134 0.727 0.106

W-K 0.904 0.043 0.910 0.082 0.907 0.064corn NGK 0 0.847 0.103 0.912 0.092 0.800 0.141

SSK 0.01 0.845 0.098 0.920 0.081 0.790 0.1370.03 0.845 0.098 0.920 0.081 0.790 0.1370.05 0.834 0.086 0.921 0.081 0.780 0.1230.07 0.827 0.088 0.920 0.081 0.760 0.1260.09 0.834 0.083 0.920 0.081 0.770 0.1160.1 0.827 0.088 0.920 0.081 0.760 0.1260.3 0.825 0.087 0.931 0.084 0.750 0.1270.5 0.779 0.104 0.886 0.094 0.700 0.1250.7 0.628 0.109 0.861 0.088 0.510 0.1370.9 0.348 0.185 0.824 0.238 0.240 0.165

W-K 0.762 0.099 0.833 0.065 0.710 0.137

Table 4: The performance (F1, precision, recall) of SVM with SSK, NGK and WK forReuters categories crude and corn. Results illustrate the impact of varying λ onperformance of SSK. The results are averaged over 10 runs of the techniques. Wealso report standard deviation.

432


of F1 at different lengths. However, the main objective of the experiments described in thissection was to analyse the behaviour of SSK by varying λ.

It is interesting to note that precision peaks at a higher value (λ = 0.7) for all thecategories except one (corn). For corn the peak is achieved at λ = 0.3 and note that oncea category has achieved maximum value, further increase in value of λ can degrade theeffectiveness of a system substantially. Furthermore, the gain in precision is substantialfor most of the categories. The gain in recall is not obtained at higher values of λ, peakis obtained at low value (λ = 0.03) for all categories except one that achieves a peak at aslightly higher value (λ = 0.05). We also note that for higher values of λ there is substantialloss in recall. We now briefly analyse the improvement in F1 numbers, with varying λ. Itseems that F1 numbers reach a maximum at a value (that is not very high) and then fallsto minimum at the highest value of λ.

Polysemy is a characteristic of the English language. It seems that the technique SSKdeals with this problem, as SSK returns a high similarity score if the documents sharemore non-contiguous substrings. A text-categorization system based on SSK can correctlyclassify the document that share same but semantically different words. This phenomenonis evident from the results.

We now compare the performance of SSK with other techniques. Note that the lengthof n-grams was set to 5 for this comparison. The results show that the effectiveness ofan SVM classifier in conjunction with SSK is as good as the generalisation performance ofan SVM classifier in conjunction with NGK. The results also show that the performance ofSSK can be better than NGK for some cases, though the gain in performance does not seemsubstantial. It is worth noting that the SSK is able to achieve higher values of precisionwhen compared to NGK.

4.3 Effectiveness of Combining Kernels

As in the preceding sections, here we describe a series of experiments to study the choiceof a kernel. We observe the influence of combining kernels on generalisation performanceof an SVM classifier. In other words, we empirically study the effect of adding respectiveinner products for different subsequence lengths and weights. Text collection and evaluationmeasures remain the same for these experiments.Combining Kernels of Different LengthsThe first set of experiments considered a kernel matrix with the entries the sum of therespective entries of string subsequence kernels of different lengths. More Formally

K =(K(di, dj)

)1≤i,j≤n

=(K1(di, dj

)+K2(di, dj)

)1≤i,j≤n

= K1 +K2

where K1 is string subsequence kernel matrix for length k1 and K2 is for length k2. Thevalue of the weight decay parameter λ was set to 0.5 for this set of experiments. Kernelsfor lengths 3 and 4, 4 and 5, 5 and 6 were combined. The results are reported in Table 5.For illustration the results for length 3, 4, 5, and 6 are also given. The results show thatthis technique of combining kernels has a potential to improve the performance of a system.The performance of an SVM with a combined SSK can be better than the performanceof an SVM with any of the individual kernel. This is evident from the value of F1 fora combination of length 3 and 4 for the category “crude”. However in some scenarios

433

Lodhi et al.

Category k1 k2 F1 Precision RecallMean SD Mean SD Mean SD

earn 3 0 0.925 0.036 0.981 0.030 0.878 0.0574 0 0.932 0.029 0.992 0.013 0.888 0.0525 0 0.936 0.036 0.992 0.013 0.888 0.0676 0 0.936 0.033 0.992 0.013 0.888 0.0603 4 0.935 0.029 0.981 0.024 0.895 0.0524 5 0.937 0.030 0.992 0.013 0.890 0.0565 6 0.938 0.034 0.992 0.013 0.893 0.062

acq 3 0 0.785 0.040 0.863 0.060 0.724 0.0644 0 0.822 0.047 0.898 0.045 0.760 0.0685 0 0.867 0.038 0.914 0.042 0.828 0.0576 0 0.876 0.051 0.934 0.043 0.828 0.0803 4 0.827 0.028 0.866 0.034 0.792 0.0374 5 0.857 0.036 0.918 0.027 0.804 0.0515 6 0.866 0.044 0.925 0.043 0.816 0.066

crude 3 0 0.881 0.077 0.931 0.101 0.853 0.1294 0 0.905 0.090 0.980 0.032 0.853 0.1435 0 0.936 0.045 0.979 0.033 0.900 0.0786 0 0.901 0.051 0.990 0.031 0.834 0.0793 4 0.932 0.048 0.958 0.071 0.913 0.0704 5 0.936 0.049 0.981 0.042 0.90 0.0905 6 0.916 0.062 0.986 0.03 0.86 0.101

corn 3 0 0.665 0.169 0.940 0.077 0.540 0.1904 0 0.783 0.103 0.924 0.086 0.690 0.1375 0 0.779 0.104 0.886 0.094 0.700 0.1256 0 0.749 0.096 0.919 0.098 0.640 0.1173 4 0.769 0.080 0.904 0.092 0.680 0.1134 5 0.776 0.090 0.904 0.093 0.69 0.1205 6 0.761 0.080 0.908 0.088 0.660 0.966

Table 5: The performance (F1, precision, recall) of SVM with combined kernels for Reuterscategories earn, acq, corn and crude. The SSK for different lengths have beencombined. The results are averaged and standard deviation is also given.

434


the combination of kernels appears of no gain showing that both the kernels give similarinformation. Note that all the results presented in this section are averaged over 10 samplesof the data.Combining NGK and SSKWe now present the experiments by adding the respective entries in a string subsequencekernel matrix and an n-grams kernel matrix. We set the length for both the kernels to 5,and for SSK the value of λ was set to 0.5. The results of this set of experiments are shownin Table 6. We not only combined the SSK and NGK but we also observed the influence ofcombining weighted entries of respective kernel matrices. The entries of NGK were weightedmore as compare to SSK, formally

K = wngNGK + wskSSK.

Unfortunately this set of experiments does not yield any improvement in the generalisationperformance of an SVM classifier.Combining SSK with different λ’sAnother set of experiments was conducted to evaluate the effect of combining SSK fordifferent λ’s. The results are reported in Table 7. The length of the subsequence wasset to 5 and two values (0.05 and 0.5)of λ were combined. The results showed that thismethodology of adding respective entries of string subsequence kernels for different λ’s andusing the resultant kernel matrix with an SVM does not improve the performance of thesystem substantially.

5. Approximating Kernels

When constructing a Gram matrix, the computational cost is often high. This may be dueto either the need for a large number of kernel evaluations (i.e. there is a large training set)or due to the high computational cost of evaluating the kernel itself. In some circumstances,both points may be true.

The approximation approach we adopt is based on a more general empirical kernel mapintroduced by Scholkopf et al. (1999). We consider the special case when the set of vectorsis chosen to be orthogonal. Assume we have some training points (xi, yi) ∈ X × Y , andsome kernel function K(x, z) corresponding to a feature space mapping φ : X �→ F suchthat K(x, z) = 〈φ(x), φ(z)〉. Consider a set S of vectors S = {si ∈ X}. If the cardinality ofS is equal to the dimensionality of the space F and the vectors φ(si) are orthogonal (i.e.K(si, sj) = Cδij)1, then the following is true:

K(x, z) = 1C

∑si∈S

K(x, si)K(z, si). (3)

1. Where δij = 1 if i = j and 0 otherwise.

435

Lodhi et al.

Category wng wsk F1 Precision RecallMean SD Mean SD Mean SD

earn 1 0 0.944 0.026 0.992 0.013 0.903 0.0510 1 0.936 0.013 0.992 0.014 0.888 0.067

0.5 0.5 0.944 0.026 0.992 0.013 0.903 0.0510.6 0.4 0.941 0.029 0.992 0.013 0.898 0.0550.7 0.3 0.941 0.029 0.992 0.013 0.898 0.0550.8 0.2 0.944 0.030 0.992 0.013 0.903 0.0600.9 0.1 0.943 0.0260 0.992 0.013 0.900 0.050

acq 1 0 0.882 0.038 0.912 0.041 0.856 0.0510 1 0.867 0.038 0.914 0.042 0.828 0.067

0.5 0.5 0.865 0.035 0.917 0.045 0.820 0.0510.6 0.4 0.878 0.031 0.908 0.036 0.852 0.0460.7 0.3 0.868 0.047 0.909 0.040 0.832 0.0730.8 0.2 0.875 0.033 0.913 0.050 0.844 0.0580.9 0.1 0.875 0.040 0.910 0.040 0.844 0.058

crude 1 1 0.937 0.048 0.968 0.045 0.913 0.0830 1 0.936 0.045 0.979 0.033 0.900 0.078

0.5 0.5 0.940 0.0360 0.987 0.027 0.900 0.0720.6 0.4 0.937 0.040 0.980 0.032 0.900 0.0720.7 0.3 0.934 0.057 0.994 0.020 0.887 0.1050.8 0.2 0.928 0.057 0.981 0.042 0.887 0.1040.9 0.1 0.926 0.063 1.000 0.000 0.867 0.104

corn 1 0 0.847 0.103 0.912 0.092 0.800 0.1410 1 0.779 0.104 0.886 0.094 0.700 0.125

0.5 0.5 0.843 0.095 0.929 0.085 0.750 0.1270.6 0.4 0.836 0.091 0.943 0.085 0.760 0.1260.7 0.3 0.827 0.098 0.916 0.083 0.760 0.1260.8 0.2 0.831 0.093 0.930 0.084 0.760 0.1260.9 0.1 0.849 0.092 0.943 0.084 0.780 0.123

Table 6: The performance (F1, precision, recall) of SVM with combined kernels for Reuterscategories earn, acq, corn and crude. The SSK and NGK are combined. Theresults are averaged and standard deviation is also given.

436


Category λ’s F1 Precision Recall1 2 Mean SD Mean SD Mean SD

earn 0.05 0.0 0.944 0.026 0.992 0.013 0.903 0.0510.5 0.0 0.936 0.013 0.992 0.014 0.888 0.067.05 0.5 0.949 0.024 0.991 0.013 0.911 0.044

acq 0.05 0.0 0.882 0.037 0.912 0.040 0.856 0.0540.5 0.0 0.867 0.038 0.914 0.042 0.828 0.0670.05 0.5 0.869 0.041 0.921 0.042 0.824 0.063

crude 0.05 0.0 0.945 0.041 0.974 0.044 0.920 0.0690.5 0.0 0.936 0.045 0.979 0.033 0.900 0.0780.05 0.5 0.940 0.039 0.980 0.032 0.907 0.071

corn 0.05 0.0 0.834 0.086 0.921 0.081 0.780 0.1230.5 0.0 0.779 0.104 0.886 0.094 0.700 0.1250.05 0.05 0.818 0.088 0.930 0.085 0.740 0.126

Table 7: The performance (F1, precision, recall) of SVM with combined kernels for Reuterscategories earn, acq, corn and crude. The SSK for different λ’s have been com-bined. The results are averaged and standard deviation is also given.

This follows from the fact that φ(x) = 1C

∑si∈S K(x, si)φ(si) and φ(z) = 1

C

∑sj∈S K(z, sj)φ(sj),

so that

K(x, z) =1C2

∑si,sj∈S

K(x, si)K(z, sj)Cδij

=1C

∑si∈S

K(x, si)K(z, si).

If instead of forming a complete orthonormal basis, the cardinality of S ⊆ S is less thanthe dimensionality of X or the vectors si are not fully orthogonal, then we can constructan approximation to the kernel K:

K(x, z) ≈∑si∈S

K(x, si)K(z, si). (4)

In this paper we propose to use the above fact in conjunction with an efficient methodof choosing S to construct a good approximation of the kernel function. If the set S iscarefully constructed, then the production of a Gram matrix which is closely aligned to thetrue Gram matrix can be achieved with a fraction of the computational cost. A problem atthis stage is how to choose the set S and how to ensure that the vectors φ(si) are orthogonal.

5.1 Choosing a subset of features

There are many possible ways to choose a set S. Heuristics may include simply selectinga random subset as in Williams and Seeger (2001), or listing all possible features and then

437

Lodhi et al.

selecting the top few according to frequency. Recently the Gram-Schmidt procedure hasalso been applied to kernel matrices in order to choose orthogonal features (Smola andScholkopf, 2000). Other approaches may include selecting data points from the training setwhich are close to being orthonormal, or using a generative model to form S. Whichevermethod is chosen, the result is a low rank approximation of the Gram matrix K. This isalso the aim of techniques such as kernel principal components analysis or latent semantickernels.

We are going to use a heuristic based on explicitly generating an orthogonal and completeset of data, and from these choosing the best points according to a given criteria, so S ⊂ Swhere S is orthogonal and complete. Suppose that the set S has size l with each entry havingn′ characters. In this case the evaluation of the feature vector will require O(nn′lt) for thestring kernel of length n and a document of length t. The computation of the approximatestring kernel would therefore require O(nn′lt) as compared with O(nt2) required by thedirect menthod. Provided n′l < t this will represent a saving. The improvement is greaterif we evaluate a kernel matrix for a training set of size m, documents each of length t. Sincethis requires O(mnn′lt+ lm2) as opposed to O(m2nt2) required for a direct evaluation of allthe entries. In this case savings will be made if l < nt2 and n′l < mt. We should thereforechoose n′ small and control the size of l. We enforce both of these inequalities in Section6 when using the string kernel on the Reuters data set. Before we discuss a particularimplementation of this method however, we must first discuss a method of measuring thesimilarity of two Gram matrices.

5.2 Similarity of Gram Matrices

In order to discover how many features are needed to obtain a good approximation of thetrue Gram matrix, we need a measure of similarity between Gram matrices. In this paperwe use the notion of alignment (Cristianini et al., to appear) which was recently proposed.In the paper the Frobenius inner product is used between Gram matrices 〈K1,K2〉F =∑m

i,j=1K1(xi, xj)K2(xi, xj).The measure of alignment between two matrices is then given as follows:

Definition 3 Alignment The (empirical) alignment of a kernel k1 with a kernel k2 withrespect to the sample S is the quantity

A(S, k1, k2) =〈K1,K2〉F√〈K1,K1〉F 〈K2,K2〉F

,

where Ki is the kernel matrix for the sample S using kernel ki.

This is the measure which we will use here. If the matrices are fully aligned (i.e. they arethe same) then the alignment measure is 1, see (Cristianini et al., to appear) or (Cristianiniet al., 2001) for details of kernel alignment.

6. Approximating the string kernel

In this section we show the usefulness of the technique by approximating the SSK. The highcomputational cost of SSK makes it a good candidate for our approach.

438


6.1 Obtaining the approximation

As mentioned above, the string kernel has time complexity of O(n|s||t|), where n is thelength of sub-sequences which we are considering, and s and t are the length of the docu-ments involved. For datasets such as the Reuters data set, which contains approximately9600 training examples and 3200 test examples with an average length of approximately2300 characters, the string kernel is too expensive to apply on large text collections.

Our heuristic for obtaining the set S is as follows:

1. We choose a substring size n.

2. We enumerate all possible contiguous strings of length n.

3. We choose the x strings of length n which occur most frequently in the dataset andthis forms our set S.

Notice that by definition all such strings of length n are orthogonal (i.e. K(si, sj) = (Cδij)for some constant C when used in conjunction with the string kernel of degree n. Using allfeatures is therefore exactly equivalent to the string kernel, whereas using a subset gives anapproximation. This is far quicker than calculating the dot product between documents.We would expect the most frequent features to result in a good approximation of thekernel, as we are discarding the less informative features. It is possible that some of thefrequent features may be non-informative however, and a possibility for improving this naiveapproach would be to use mutual information as part of the selection process.

6.2 Selecting the subset

To show how the string kernel can be approximated, and to decide on the number of featureswhich need to be used, we conducted the following experiment. First of all we generated allpossible 273 = 19683 3-grams (26 letters and 1 space) and computed the k = 3 string kernelbetween each of these 3-grams and the first 100 documents in the Reuters dataset. Notethat each of these kernel evaluations is very cheap, as one of the “documents” has only 3characters. We then calculated the Gram matrix for the 100 documents using these featuresand a linear dot product, and also using the full string kernel. The alignment between thetwo matrices was 1; empirically confirming that the two are equivalent. We then extractedall contiguous 3-grams present in the first 100 documents (there are 10099) and computedthe alignment using these features, the result was 0.999989 which is very close to completealignment. We then computed the alignment when using the top 10000-200 features (insteps of 200) and then from 200-5 features in steps of 5. In order to obtain a comparisonwe repeated the experiment using the most infrequent 3-grams, and features selected atrandom. The results are shown in figure 1. As can be seen from the graph, only a smallnumber of features are required to generate a good alignment. By simply using the top 5features we obtain an alignment score of 0.966, whereas the top 200 features gives a score of0.992. Even when using the most infrequent features in the database, the alignment scorestill rapidly approaches 1 as we increase the number of features, and as expected usingrandom features places us in between these two results. We have therefore shown that evenby using a small number of features < 200 we can achieve a good approximation of the

439

Lodhi et al.

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.5

0.6

0.7

0.8

0.9

1

Frequent InfrequentRandom

Figure 1: Alignment scores (against the Gram matrix generated by the full string kernel)when using the most frequent, infrequent and random selection of features.

string kernel. The full Gram matrices for large datasets can now be efficiently generatedand used with any kernel algorithm.

7. Experimental Results

In order to evaluate the proposed approach we conducted experiments based on the “ModApte”split of the Reuters-21578 dataset. The only pre-processing run on the dataset before ex-perimentation was the removal of common stop-words. We ran an approximation of thestring kernel with k = 3, 4, 5 on the full Reuters data set. In order to compare our results tocurrent techniques, we used a bag of words approach. We also used the software SVM light

(Joachims, 1999) to run the experiments once the bag of words and approximation featureshad been generated. Performance of the proposed technique was compared with n-grams.For n-grams as pre-processing the stop words were removed. We conducted experiments forn = 3, 4, 5. Note that for all the experiments described in this section we used the SVM light

package. Table 8 summarises the results obtained from the preliminary experiments, thetop row of the table indicates the number of features used in the approximation. Note thatthe preliminary experiments were conducted for k = 5. For the ship category, the number offeatures used in the approximation has a direct affect on the generalisation error. Categorycorn also present the similar performance. When used in conjunction with the earn and acqcategories though, the results are stable. This suggests the possibility of using a heuristicwhich increases the number of features used until the increase in alignment is negligible.The relationship between the quality of alignment and generalisation error could then beinvestigated to see if a good correlation exists. For our next set of experiments we set thenumber of features to 3000. The result of this set of experiments are given in Table 9. Wehave given F1 numbers for all three techniques. This table shows that the results that are

440


1000 3000earn 0.97 0.97acq 0.88 0.85ship 0.10 0.53corn 0.15 0.65

Table 8: F1 numbers for 4 Reuters categories. Comparing different numbers of features inthe approximation to the SSK on 4 categories of the Reuters dataset.

Category WK NGK Approximated SSKn k

3 4 5 3 4 5earn 0.982 0.982 0.984 0.974 0.970 0.970 0.970acq 0.948 0.919 0.932 0.887 0.850 0.880 0.880money-fx 0.775 0.744 0.757 0.692 0.700 0.760 0.740grain 0.930 0.852 0.840 0.758 0.800 0.820 0.800crude 0.880 0.857 0.848 0.640 0.820 0.840 0.840trade 0.761 0.774 0.779 0.767 0.700 0.730 0.730interest 0.691 0.708 0.719 0.503 0.630 0.660 0.690ship 0.797 0.657 0.626 0.321 0.610 0.650 0.530wheat 0.870 0.803 0.797 0.629 0.780 0.790 0.820corn 0.895 0.761 0.610 0.459 0.630 0.630 0.680

Table 9: F1 numbers for SVM with WK, NGK and SSK for top-ten Reuters categories.

comparable to WK and NGK. In order to gauge the efficiency of the approach, rememberthat the string kernel takes O(n|s||t|) time. Using n = 5 and an approximate word lengthof 2000 characters, with a training set of 9603 documents (as in the Reuters dataset) thisnumber becomes 9603 × 9603 × 5 × 2000 × 2000 ≈ 1.8x1015 to naively generate the Grammatrix. For the approximation approach however, using 100 features (i.e. strings of length3) we have 9603× 100× 5× 3× 2000 ≈ 2.8x1010 which is considerably faster.

8. Conclusions

The paper has presented a novel kernel and its approximation for text analysis. The per-formance of the string subsequence kernel was empirically tested by applying it to a textcategorization task. This kernel can be used with any kernel-based learning system, forexample in clustering, categorization, ranking, etc. In this paper we have focused on textcategorization, using a Support Vector Machine.

Although the kernel does not incorporate any knowledge of the language being used(apart from in the removal of stop words), it does capture semantic information, to theextent that it can outperform state of the art systems on some data.

441

Lodhi et al.

This paper builds on preliminary results presented in Lodhi et al. (2001). This is mostlikely due to the extreme computational cost of accessing such a feature space without usingthe kernel trick. For a given sequence length k the features are indexed by all strings oflength k. Direct computation of all the relevant features would be impractical even formoderately sized texts and k. We have used a dynamic programming style computation forcomputing the kernel directly from the input sequences without explicitly calculating thefeature vectors. This has been possible since we have used a kernel-based learning machine.

The experiments indicate that our algorithm can provide an effective alternative tothe more standard word feature based kernel used in previous SVM applications for textclassification (Joachims, 1998). We were also able to compare them with results obtainedusing a string kernel that only considered contiguous strings, thus providing a continuumof varying the parameter λ to 0. The kernel for contiguous strings was also combined withthe non-contiguous kernel to examine to what extent different features were being used inthe classification.

In addition different lengths of strings were considered and a comparison made of theresults obtained on Reuters data using a range of different values.

In order to apply the proposed approach on large datasets, we derived a fast algo-rithm that approximates the exact string kernel. We have also introduced a method forapproximating kernels based on using a subset of orthogonal features. This allows the fastconstruction of kernel Gram matrices. We have shown that by using all features we arriveexactly at the string kernel solution, however we have also shown that good approximationscan be obtained by using a relatively small number of features. In order to illustrate thistechnique we conducted an extensive set of experiments. We provided experimental resultsusing the string kernel which is known to be computationally expensive. Using the approx-imation however, we were able to achieve results on the full Reuters dataset which werecomparable to those produced by the bag of words approach.

The results on the full Reuters dataset were, however, less encouraging. In most casesthe word kernel and contiguous n-grams kernel outperformed the string kernel. This led usto conjecture that the excellent results on smaller datasets demonstrate that the kernel isperforming something similar to stemming, hence providing semantic links between wordsthat the word kernel must view as distinct. This effect is no longer so important in largedatasets where there is enough data to learn the relevance of the terms.

It is an open question to see whether ranking the features according to positive andnegative examples and introducing a weighting scheme will improve results. Also once agood approximation to the Gram matrix is produced, it can easily be used with other kernelmethods (e.g. for clustering, principal components analysis, etc.). The relationship betweenthe quality of the approximation and the generalisation error achieved by an algorithm alsoneeds to be explored. One could also consider using the fast approximation with very fewfeatures to obtain a very coarse Gram matrix. A preliminary estimate of the support vectorscould then be obtained, and the entries for these vectors could then be refined. This couldgive rise to a fast form of chunking for large datasets.

The paper has provided a fairly thorough testing of the use of string kernels for textdata, in particular considering the effect of varying the lengths, the value of the decayparameter, and combinations of different kernels. It has also developed an approximationstrategy that enables one to apply the approach to large datasets. Future work will consider

442


the extension of the techniques to strings of syllables and words as well as to types of dataother than text.

References

B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal marginclassifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop onComputational Learning Theory, pages 144–152, Pittsburgh, PA, July 1992. ACM Press.

W. B. Cavnar. Using an n-gram based document representation with a vec-tor processing retrieval model. In D. K. Harman, editor, Proceedings ofTREC –3, 3rd Text Retrieval Conference, pages 269–278, Gaithersburg, Mary-land, US, 1994. National Institute of Standards and Technology, Gaithersburg, US.http://trec.nist.gov/pubs/trec3/t3proceedings.html.

N. Cristianini, A. Elisseef, and J. Shawe-Taylor. On optimizing kernel alignment. TechnicalReport NC-TR-01-087, Royal Holloway, University of London, UK, 2001.

N. Cristianini, A. Elisseef, and J. Shawe-Taylor. On kernel-target alignment. In NeuralInformation Processing System (NIPS ’01), to appear.

N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines. CambridgeUniversity Press, Cambridge, UK, 2000.

T. Friess, N. Cristianini, and C. Campbell. The kernel-adatron: a fast and simple trainingprocedure for support vector machines. In J. Shavlik, editor, Proceedings of the 15thInternational Conference on Machine Learning. Morgan Kaufmann, 1998.

D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, University of California in Santa Cruz, Computer Science Department, July 1999.

S. Huffman. Acquaintance: Language-independent document categorization byn-grams. In D. K. Harman and E. M. Voorhees, editors, Proceedings ofTREC –4, 4th Text Retrieval Conference, pages 359–371, Gaithersburg, Mary-land, US, 1995. National Institute of Standards and Technology, Gaithersburg, US.http://trec.nist.gov/pubs/trec4/t3proceedings.html.

T. Joachims. Text categorization with support vector machines: Learning with many rele-vant features. In Claire Nedellec and Celine Rouveirol, editors, Proceedings of the Euro-pean Conference on Machine Learning, pages 137–142, Berlin, 1998. Springer.

T. Joachims. Making large–scale SVM learning practical. In B. Scholkopf, C. J. C. Burges,and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages169–184, Cambridge, MA, 1999. MIT Press.

H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using stringkernels. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Adavances in NeuralInfromation Processing Systems, pages 563–569, Cambridge, MA, 2001. MIT Press.

443

Lodhi et al.

J. Mercer. Functions of positive and negative type and their connection with the theoryof integral equations. Philosophical Transactions of the Royal Society London (A), 209:415–446, 1909.

G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Commu-nications of the ACM, 18(11):613–620, 1975.

B. Scholkopf. Support Vector Learning. R. Oldenbourg Verlag, Munchen,1997. Doktorarbeit, Technische Universitat Berlin. Available fromhttp://www.kyb.tuebingen.mpg.de/∼bs.

B. Scholkopf, S. Mika, C. Burges, P. Knirsch, K.-R.Muller, G. Ratsch, and A. Smola. Inputspace vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks,10(5):1000–1017, 1999.

A. J. Smola and B. Scholkopf. Sparse greedy matrix approximation for machine learning. InP. Langley, editor, Proceedings of the Seventeenth International Conference on MachineLearning, pages 911–918. Morgan-Kauffman, 2000.

V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995.

C. Watkins. Dynamic alignment kernels. In A. J. Smola, P. L. Bartlett, B. Scholkopf, andD. Schuurmans, editors, Advances in Large Margin Classifiers, pages 39–50, Cambridge,MA, 2000. MIT Press.

C. Williams and M. Seeger. Using the Nystrom method to speed up kernel machines. InT. K. Leen, T. G. Dietterich, and V. Tresp, editors, Adavances in Neural InfromationProcessing Systems, pages 682–688, Cambridge, MA, 2001. MIT Press.

444

text classification using string kernels

Documents