[ieee 2013 ieee international conference on control system, computing and engineering (iccsce) -...

The Limitation of the SVD for LatentSemantic Indexing

Andri MirzalFaculty of Computing, N28-439-03

Universiti Teknologi Malaysia81310 UTM Johor Bahru, Malaysia

Email: [email protected]

Abstract—Latent semantic indexing (LSI) is an indexingmethod for improving retrieval performance of an informationretrieval system by grouping related documents to the sameclusters so that each of these documents indexes the same(or almost the same) words, and unrelated documents index(relatively) different words. The de facto standard method forLSI is the truncated singular value decomposition (SVD). In thispaper, we show that the LSI capability of the truncated SVD isnot as conclusive as previously reported; rather it is a conditionalaspect and when the condition is not met, then the truncated SVDcan fail in recognizing the related documents resulting in a poorretrieval performance.

Keywords—document analysis, information retrieval, latent se-mantic indexing, singular value decomposition

I. INTRODUCTION

Latent semantic indexing (LSI) is a technique for improv-ing retrieval performance of an information retrieval (IR) sys-tem by grouping related documents to the same clusters so thateach of these documents indexes the same (or almost the same)words, and unrelated documents index (relatively) differentwords. LSI was introduced by Deerwester et al. [1], and anextensive review can be found in ref. [2]. Technically, LSIis conducted by making every document to also index set ofwords that appear in all related documents, and by weakeninginfluences of words that appear in unrelated documents. Hence,a query can retrieve not only documents that index the querywords, but also related documents that do not index the querywords. In addition, irrelevant documents that contain the querywords can be filtered out since their influences are weakened.These two simultaneous mechanisms can sometimes handlesynonym and polysemy problems.

The de facto standard method for LSI is the truncatedsingular value decomposition (SVD). The truncated SVD canbe used for LSI because it transforms the original documentspace into a lower dimensional subspace in which the distancesbetween related documents shrink and the distances betweenunrelated documents expand. Even though there have beenmany works reported the LSI capability of the truncated SVD,e.g., [1]–[8], to our knowledge there is still lack of works thatdemonstrate its limitation.

In this paper, we demonstrate the limitation of the truncatedSVD for LSI by presenting cases in which the method fails toidentify the correct document clusters and consequently cannothandle the synonym and polysemy problems, resulting in poorperformances compared to the vector space model [9]—the

classical IR approach that directly uses the original documentspace.

II. THE TRUNCATED SVD

The truncated SVD is a reduced rank SVD approximationto a matrix. Some applications of this technique are for 1)approximating a matrix [10], 2) computing pseudoinverse [11],3) determining rank, range, and null space of a matrix [12],and 4) clustering [13], [14]. Given a matrix A ∈ ℂ

𝑀×𝑁 withrank(A) = 𝑟, the SVD of A can be written as:

A = UΣV𝑇 ,

where U ∈ ℂ𝑀×𝑀 = [u1, . . . ,u𝑀 ] denotes a unitary

matrix that contains the left singular vectors, V ∈ ℂ𝑁×𝑁 =

[v1, . . . ,v𝑁 ] denotes a unitary matrix that contains the rightsingular vectors, and Σ ∈ ℝ

𝑀×𝑁+ denotes a matrix that con-

tains the singular values along its diagonal with the diagonalentries 𝜎1 ≥ . . . 𝜎𝑟 > 𝜎𝑟+1 = . . . = 𝜎min(𝑀,𝑁) = 0 and zerosotherwise.

Rank-𝐾 truncated SVD approximation to A is definedwith:

A ≈ A𝐾 = U𝐾Σ𝐾V𝑇𝐾 , (1)

where 𝐾 < 𝑟, U𝐾 and V𝐾 contain the first 𝐾 columnsof U and V respectively, and Σ𝐾 denotes 𝐾 ×𝐾 principalsubmatrix of Σ.

III. SUPPORTIVE EXAMPLES

As noted in the introduction, the truncated SVD has LSIcapability because it can handle synonym and polysemy prob-lems. In this section, we show how the truncated SVD solvesthese problems by using synthetic datasets. For this purpose,the datasets were designed so that their structures allow theproblems to be resolved.

A. Synonym problem

Synonyms are words with similar or related meanings.For example: university, college, and institute are synonyms.Because they have similar meaning, an IR system is expectedto be able to also retrieve documents that contain the syn-onyms of, but not, words in the query. If the vector spacemodel is used, obviously relevant documents that containsonly synonyms of the query words will not be retrieved.Example in table I (taken from ref. [5]) describes synonymproblems associated with the vector space model. Note that

2013 IEEE International Conference on Control System, Computing and Engineering, 29 Nov. - 1 Dec. 2013, Penang, Malaysia

978-1-4799-1508-8/13/$31.00 ©2013 IEEE 413

TABLE I. THE VECTOR SPACE MODEL FOR DESCRIBING SYNONYMY.

Word Doc1 Doc2 Doc3 Doc4 Doc5mark 15 0 0 0 0twain 15 0 20 0 0samuel 0 10 5 0 0clemens 0 20 10 0 0purple 0 0 0 20 10colour 0 0 0 15 0

TABLE II. LSI USING THE TRUNCATED SVD FOR MATRIX IN TABLE I.

Word Doc1 Doc2 Doc3 Doc4 Doc5mark 3.72 3.50 5.45 0 0twain 11.0 10.3 16.1 0 0samuel 4.15 3.90 6.08 0 0clemens 8.30 7.80 12.2 0 0purple 0 0 0 21.0 7.08colour 0 0 0 13.5 4.55

‘Mark Twain’ and ‘Samuel Clemens’ refer to the same person,and ‘purple’ and ‘colour’ are closely related. So that referenceclasses for the documents are {Doc1, Doc2, Doc3} and {Doc4,Doc5}, and for the words are {mark, twain, samuel, clemens}and {purple, colour} with the first (second) document classcorresponds to the first (second) word class.

In the vector space model, task of finding relevant docu-ments to a query is conducted by calculating distances (usuallyin cosine criterion [2]) between query vector q ∈ ℝ

𝑀×1+ and

document vectors a𝑛 ∈ ℝ𝑀×1+ , for ∀𝑛. The more relevant the

document to the query, the closer the distance between them.Query vector is analogous to document vectors; it indexeswords that appear both in query and word-by-document matrixA = [a1, . . . ,a𝑁 ]. Note that when there are preprocessingsteps or the truncated SVD is used to approximate the matrix,then the modified/approximate matrix A is used instead.

For the dataset in table I, when a query containing ‘mark’and ‘twain’ is created (q𝑇 = [1 1 0 0 0 0]), then the resultwill be [1.00 0.00 0.62 0.00 0.00] (derived by computingcosine distances between q and a𝑛, for ∀𝑛). So, only Doc1and Doc3 will be retrieved; and Doc2 which contains ‘samuel’and ‘clemens’ (synonyms of ‘mark’ and ‘twain’) will not berecognized as relevant. Similarly, a query containing ‘colour’but not ‘purple’ will not be able to retrieve Doc4.

According to a result by Kontostathis and Pottenger [6],LSI using the truncated SVD can recognize synonyms as longas there is a short path that chain the synonyms together.For example in table I ‘mark’ and ‘twain’ are connected to‘samuel’ and ‘clemens’ through Doc3. So it can be expectedthat LSI using the truncated SVD can recognize the synonyms.Similarly, ‘colour’ and ‘purple’ are connected through Doc4,so LSI is also expected to be able to reveal this connection.

Table II shows rank-2 SVD approximation to the originalmatrix in table I (the rank was chosen based on the number ofreference classes). As shown, Doc1, Doc2, and Doc3 are nowindexing ‘mark’, ‘twain’, ‘samuel’, and ‘clemens’, and Doc4and Doc5 are now indexing ‘purple’ and ‘colour’. Thus, allrelevant documents will be correctly retrieved if appropriatequeries are made.

B. Polysemy problem

Polysemy is the problem of a word with multiple meaningsbut are not necessarily related. Since a polyseme can appear in

TABLE III. THE VECTOR SPACE MODEL FOR DESCRIBING POLYSEMY.

Word Doc1 Doc2 Doc3 Doc4 Doc5 Doc6money 1 0 1 0 0 0bed 0 1 0 1 0 1river 0 1 0 1 0 0bank 1 1 1 1 1 1interest 1 0 1 0 1 0

TABLE IV. LSI USING THE TRUNCATED SVD FOR DATA IN TABLE III.

Word Doc1 Doc2 Doc3 Doc4 Doc5 Doc6money 0.809 -0.0550 0.809 -0.0550 0.547 0.0621bed -0.0239 1.08 -0.0239 1.08 0.117 0.738river -0.0550 0.809 -0.0550 0.809 0.0621 0.547bank 1.06 1.06 1.06 1.06 0.855 0.855interest 1.08 -0.0239 1.08 -0.0239 0.738 0.117

unrelated documents, a query containing it will probably alsoretrieve unrelated documents. Table III gives an example ofsuch problem where ‘bank’ either refers to financial institutionor area near river. By inspection it is clear that referenceclasses for the documents are {Doc1, Doc3, Doc5} and {Doc2,Doc4, Doc6}, and reference classes for the words are {money,bank, interest} and {bed, river, bank} with the first (second)document class corresponds to the first (second) word class.

If query q1 containing ‘bank’ and ‘money’ (a query corre-sponds to the first document class) is made to the vector spacemodel in table III, then only Doc1 and Doc3 will be recognizedas relevant since the other documents have the same score.Similarly, if query q2 containing ‘river’ and ‘bank’ (a querycorresponds to the second document class) is made, then onlyDoc2 and Doc4 will be retrieved.

Table IV shows rank-2 SVD approximation to the originalmatrix in table III. If the same q1 and q2 are made to thisapproximate matrix, then the result for q1 will be [0.77 0.410.77 0.41 0.79 0.51], and for q2 will be [0.41 0.770.41 0.77 0.51 0.79 ]. So, the truncated SVD can handlethe polysemy problem in this case. And, as now Doc5 indexes‘money’ and Doc6 indexes ‘river’, all vertices in the same classare connected to each other in the bipartite graph representationof the approximate matrix. And thus any appropriate querywill be able to retrieve all relevant documents (except whenthe query contains only the polysemy word ‘bank’).

IV. COUNTER EXAMPLES

The previous examples were designed to demonstrate LSIcapability of the truncated SVD. Certainly, the results are notconclusive since one can create counter examples to showits limitation. For synonym problems, the obvious exampleis to delete Doc3 from the collection, so that the path thatpreviously connecting Doc1 to Doc2 disappears. Table Vshows the modified dataset of table I where Doc3 and Doc5 areremoved from the collection, and table VI shows rank-2 SVDapproximation to this modified dataset. As shown, now MarkTwain and Samuel Clemens are no longer recognized as thesame person. And since no document is retrieved for queriescontaining ‘mark’ and/or ‘twain’, retrieval performance of thetruncated SVD is actually worse than the vector space model.

The same condition also goes to the polysemy problem.Table VII shows a modification version of the example intable III where we preserve the polysemy status of ‘bank’ as


414

TABLE V. THE MODIFIED DATASET OF TABLE I.

Word Doc1 Doc2 Doc4mark 15 0 0twain 15 0 0samuel 0 10 0clemens 0 20 0purple 0 0 20colour 0 0 15

TABLE VI. LSI USING THE TRUNCATED SVD FOR TABLE V.

Word Doc1 Doc2 Doc4mark 0 0 0twain 0 0 0samuel 0 10 0clemens 0 20 0purple 0 0 20colour 0 0 15

it appears in two different contexts (related to money and areanear river). As shown, there is no clue that Doc1 is related toDoc3 and Doc2 is related to Doc4. And if a query containing‘bank’ and ‘money’ (which should retrieve Doc1 and Doc3) ismade to this vector space model, Doc1, Doc2, and Doc3 willbe retrieved instead. Similarly, if a query containing ‘river’and ‘bank’ (which should retrieve Doc2 and Doc4) is made,then Doc1, Doc2, and Doc4 will be retrieved instead. TableVIII shows rank-2 SVD approximation to the matrix in tableVII. And if the similar queries are made to this matrix, thenDoc2, Doc1, and Doc3 will be recognized as relevant to ‘bank’and ‘money’, and Doc1 and Doc2 to ‘river’ and ‘bank’. So,both the vector space and the truncated SVD are unable toretrieve the correct results. However, the vector space modeloutperforms the truncated SVD in this case because it actuallyretrieves all relevant documents (plus irrelevant ones).

Thus, from these two counter examples it can be inferredthat the LSI capability of the truncated SVD is not as conclu-sive as it was previously reported [1], [2], [5].

V. EXPERIMENTAL RESULTS

We will now compare retrieval performances of thetruncated SVD and the vector space model using standarddatasets in LSI researches (the datasets can be downloadedat http://web.eecs.utk.edu/research/lsi/). Table IX summarizesthe datasets where #Documents, #Words, %NNZ, and #Queriesdenote the number of documents, the number of unique words,the percentage of nonnegative entries, and the number of pre-defined queries made to the corresponding word-by-documentmatrix.

Each of the text collections shown in table IX comprisesof three important files. The first file contains abstracts ofthe documents which each indexed by a unique identifier, thesecond file contains the list of queries each with its uniqueidentifier, and the third document contains a dictionary thatmaps every query to its manually assigned relevant documents.

The first file is the file that being used to constructthe word-by-document matrix A ∈ ℝ

𝑀×𝑁+ . To extract the

unique words, the stop words and words that shorter thantwo characters were removed. But we did not employ anystemming and did not remove words that only belong to onedocuments. The reasons are the stemming process seems tobe not popular in LSI researches and removing unique words

TABLE VII. MODIFIED DATASET OF TABLE III.

Word Doc1 Doc2 Doc3 Doc4money 0 0 1 0bed 0 1 0 0river 0 0 0 1bank 1 1 0 0interest 1 0 0 0

TABLE VIII. LSI USING THE TRUNCATED SVD FOR TABLE VII.

Word Doc1 Doc2 Doc3 Doc4money -0.25899 0.25899 0.72696 -0.25365bed 0.40773 0.59227 0.25899 -0.09036river 0.09036 -0.09036 -0.25365 0.0885bank 1 1 0 0interest 0.59227 0.40773 -0.25899 0.09036

can potentially reduce the recall since there is possibility thatqueries contain these words. Then after the word-by-documentmatrix was constructed, we further adjusted the entry weightsby using logarithmic scale, i.e., 𝐴𝑖𝑗 ← log(𝐴𝑖𝑗+1), but did notnormalized the columns of the matrix. This is because based onour pre-experimental results, the logarithmic scale performedbetter than the word frequency, and the normalization hadnegative effect on the retrieval performances for both the vectorspace model and the truncated SVD for all text collections.

The second file is used to construct query matrix Q ∈ℝ

𝑄×𝑀+ = [q1, . . . , q𝑄 ]𝑇 where 𝑄 denotes the number of

queries (shown in the last row of table IX), 𝑀 denotes thenumber of unique words which is the same with the numberof unique words in the corresponding A, and q𝑞 denotes the𝑞-th query vector constructed from the file. So by multiplyingQ with A, one can get a matrix that contains the scoresthat describe how relevant documents to each query in itscorresponding row.

And the third file is the file that maps each query to itsmanually assigned relevant documents. This information wasutilized as the references to measure the quality of the retrievalperformances.

Recall and precision are the most commonly used metricsto measure IR performance. Recall measures proportion ofretrieved relevant documents so far to all relevant documents inthe collection. And precision measures proportion of retrievedrelevant documents to all retrieved documents so far. Recallis usually not indicative of retrieval performance since it istrivial to get perfect recall by retrieving all documents. Andas discussed by Kolda and O’Leary [5], pseudo-precisionat predefined recall level captures both recall and precisionconcepts. We used a modified version of this metric knownas average precision [15], a standard metric in IR researchthat measures 𝐼-point interpolated average pseudo-precisionat recall level [0, 1]. In the following, the definition andformulation of the metric are outlined. Detailed discussion canbe found in, e.g., ref. [5], [15], [16].

TABLE IX. THE TEXT COLLECTIONS.

Medline Cranfield CISI ADI#Documents 1033 1398 1460 82#Words 12011 6551 9080 1215%NNZ 0.4567 0.85674 0.51701 2.1479#Queries 30 225 35 35


415

TABLE X. AVERAGE PRECISION COMPARISON.

Medline Cranfield CISI ADIVector space 0.4895 0.3537 0.1562 0.3185Truncated SVD 0.4967(600) 0.3365(600) 0.1617(170) 0.2663(33)

Let r = q𝑇A be sorted in descending order. The precisionat 𝑛-th document is given by:

𝑝𝑛 =𝑟𝑛𝑛.

where 𝑟𝑛 denotes the number of relevant documents up to 𝑛-th position. The pseudo-precision at recall level 𝑥 ∈ [0, 1] isdefined as:

𝑝(𝑥) = max{𝑝𝑛 ∣ 𝑥 ≤ 𝑟𝑛/𝑟𝑁 , 𝑛 = 1, . . . , 𝑁},where 𝑟𝑁 denotes the total number of relevant documents inthe collection. The 𝐼-interpolated average pseudo-precision atrecall level 𝑥 ∈ [0, 1] for a single query is defined as:

1

𝐼

𝐼−1∑𝑛=0

𝑝

(𝑛

𝐼 − 1

),

where as stated previously, 𝑛 denotes the 𝑛-th position in r.We used 11-point interpolated average precision as proposedin ref. [5] because three out of four datasets used here aresimilar to those used in ref. [5]. However, due to the differencesin the preprocessing steps, our results are not similar to theresults in ref. [5]. And because there are several queries ineach text collection (shown in the last row of table IX), averageprecision used in this work is the average value over #Query.So, for each text collection:

𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ≡ 1

𝑄

𝑄∑𝑞=1

𝑝𝑟𝑒𝑐𝑞.

where 𝑝𝑟𝑒𝑐𝑞 denotes 11-point interpolated average precisionfor 𝑞-th query.

Table X shows average precision comparison between thevector space model and the truncated SVD with the valuesdisplayed for the truncated SVD are in format bestval (rank),where bestval denotes the best value over all the decompositionranks (for Medline, Cranfield, and CISI the used ranks were𝑘 ∈ [10, 20, . . . , 600], and for ADI the used ranks were𝑘 ∈ [1, 2, . . . , 40]) and rank is the rank at bestval. Asshown, the truncated SVD offers no improvement over thevector space model in these datasets. And by consideringthe computational times and the problem of determining theoptimum decomposition ranks in the truncated SVD, it seemsthat the truncated SVD rather is not a promising technique forLSI. Note that our results can be different from other worksthat usually show that the truncated SVD is a much betterLSI method than the vector space model. But at least thiswork shows that LSI capability of the truncated SVD is notas conclusive as it was previously reported.

VI. CONCLUSION

We have demonstrated the limitation of the truncated SVDin LSI in which it has been shown that the LSI capability ofthe truncated SVD is not as conclusive as previously reported.Depending on the structures of the datasets, the truncated

SVD can fail to improve the retrieval performance of an IRsystem. Since real datasets may or may not have favourablestructures, we suggest that the good retrieval performances ofthe truncated SVD reported in many works were more dueto the preprocessing steps rather than its LSI capability. Andbecause computing the truncated SVD of a matrix is expensive,the effort should be spent in refining the preprocessing stepsbefore applying any sophisticated LSI method.

ACKNOWLEDGMENT

The author would like to thank the reviewers for use-ful comments. This research was supported by Ministryof Higher Education of Malaysia and Universiti Tekno-logi Malaysia under Exploratory Research Grant SchemeR.J130000.7828.4L095.

REFERENCES

[1] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman,“Indexing by latent semantic analysis,” Journal of the American Societyfor Information Science, vol. 41, no. 6, pp. 391–407, 1990.

[2] M. Berry, S. Dumais, and G. O’Brien, “Using linear algebra forintelligent information retrieval,” SIAM Rev., vol. 37, no. 4, pp. 573–595, 1995.

[3] O. Alhabashneh, R. Iqbal, N. Shah, S. Amin, and A. James, “Towardsthe development of an integrated framework for enhancing enterprisesearch using latent semantic indexing,” in Proc. 19th Int’l Conf. onConceptual structures for discovering knowledge, 2011, pp. 346–352.

[4] S. P. Crain, K. Zhou, S.-H. Yang, and H. Zha, “Dimensionality reductionand topic modeling: From latent semantic indexing to latent dirichletallocation and beyond.” in Mining Text Data, C. C. Aggarwal andC. Zhai, Eds. Springer, 2012, pp. 129–161.

[5] T. Kolda and D. O’Leary, “A semidiscrete matrix decomposition forlatent semantic indexing information retrieval,” ACM Trans. Inf. Syst.,vol. 16, no. 4, pp. 322–346, 1998.

[6] A. Kontostathis and W. Pottenger, “A framework for understandinglatent semantic indexing (lsi) performance,” Information Processing andManagement, vol. 42, no. 1, pp. 56–73, 2006.

[7] C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, “Latentsemantic indexing: A probabilistic analysis,” J. Computer and SystemSciences, vol. 61, no. 2, pp. 217–235, 2000.

[8] D. Thorleuchter and D. V. den Poel, “Improved multilevel security withlatent semantic indexing,” Expert Systems with Applications, vol. 39,no. 18, pp. 13 462–13 471, 2012.

[9] G. Salton and M. McGill, Introduction to Modern Information Retrieval.McGraw-Hill, New York, 1983.

[10] C. Eckart and G. Young, “The approximation of one matrix by anotherof lower rank,” Psychometrika, vol. 1, pp. 211–218, 1936.

[11] G. Golub and W. Kahan, “Calculating the singular values and pseudo-inverse of a matrix,” J. SIAM Numerical Analysis, vol. 2, no. 2, pp.205–224, 1965.

[12] G. Golub and C. van Loan, Matrix computations (3rd ed.). JohnsHopkins University Press, 1996.

[13] I. Dhillon, “Co-clustering documents and words using bipartite spectralgraph partitioning,” in Proc. 7th ACM SIGKDD Int’l Conference onKnowledge Discovery and Data Mining, 2001, pp. 269–274.

[14] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay, “Clusteringlarge graphs via the singular value decomposition,” Machine Learning,vol. 56, no. 1-3, pp. 9–33, 2004.

[15] D. Harman, Overview of the Third Text Retrieval Conference (TREC-3).National Institute of Standards and Technology, 1996.

[16] M. Berry and M. Browne, Understanding Search Engines: Mathemat-ical Modelling and Text Retrieval, 2nd edition. SIAM, Philadelphia,2005.


416