indexing by latent semantic analysis written by deerwester, dumais, furnas, landauer, and harshman...

22
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Post on 20-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Indexing by Latent Semantic Analysis

Written by Deerwester, Dumais, Furnas, Landauer, and Harshman

(1990)

Reviewed by Cinthia Levy

Page 2: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Term-matchingMost retrieval systems match words of a query

(keywords) with words of a document.

ProblemWhat if users want to retrieve information

based upon conceptual content?

Page 3: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Expressing a concept in keywords is complicated and unreliable

Synonymy: many ways to define a concept. Results in ‘poor recall’.

Polysemy: most words have multiple meanings. Results in ‘poor precision’.

Page 4: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Three factors contribute to the failure that IR systems have in overcoming problems associated w/synonymy & polysemy:

1. Identification of index terms is incomplete2. No automatic method adequately addresses

polysemy3. Technical: the way current IR systems work

Page 5: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Goal

...to build an IR system that predicts what terms “really” are implied by a query or what terms “really” apply to a document (i.e. the latent semantics).

Page 6: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Choosing a model

Proximity model:

similar items are put near each other in some space or structure.

Page 7: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Existing proximity models include:

Hierarchical, partition & overlapping clusterings

Ultrametric & additive trees Factor-analytic & multidimensional distance

models

Page 8: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Alternate model was considered, based on the following criteria:

1. Adjustable representational richness2. Explicit representation of both terms and

documents3. Computational tractability for large datasets

Page 9: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Singular value decomposition (SVD) or two-mode factor analysis,

satisfied all three criteria!

SVD: a fully automatic statistical method used to determine associations among terms in a large document collection, and to create a semantic or concept space.

Page 10: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Basis of LSI:

Documents are condensed to contain only “content words” w/semantic meaning

Patterns of word distribution (co-occurrence) are analyzed across a collection of documents.

Page 11: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Basis of LSI:

Document collection is examined as a whole

Documents with many words in common are semantically close.

Documents with few words in common are semantically distant.

Page 12: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Steps of LSI:

Format document: stop words removed, punctuation removed, no capitalization.

Select content words: words with no semantic value are removed using stop list.

Apply Stemming*: reduces words to root form.

*(not applied in Deerwester, et al.)

Page 13: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Result: List of content words

The list of content words is used to generate a term-document matrix.

Page 14: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Term-document matrix

Page 15: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Term-document matrix:

• Term weighting* is applied to each value• SVD algorithm is applied to the matrix• Matrix represents vectors in a multi-

dimensional space

*(not applied in Deerwester, et al.)

Page 16: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Visual representation of a three-dimensional space:Content words form three orthogonal axes (mutually

perpendicular)

eggsbaconcoffee

Page 17: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

“If you draw a line from the origin of the graph to each of these points, you obtain a set of vectors in 'bacon-eggs-and-coffee' space. The size and direction of each vector tells you how many of the three key items were in any particular order, and the set of all the vectors taken together tells you something about the kind of breakfast people favor on a Saturday morning.”

Retrieved from:

http://javelina.cet.middlebury.edu/lsa/out/lsa_explanation.htm

Page 18: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Retrieved fromhttp://lsi.research.telcordia.com/lsi-bin/lsiQuery

Page 19: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Romans 1:22 Professing themselves to be wise, they became fools…

Romans 16:6 Greet Mary, who bestowed much labour on us.

Matthew 24:22 And except those days should be shortened, there should no flesh be saved: but for the elect's sake those days shall be shortened.

John 3:17 For God sent not his Son into the world to condemn the world; butthat the world through him might be saved.

Page 20: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

(Deerwester…)

System compared to:• Straight term matching• Voorhees• SMART

Using:1. collection of medical abstracts (MED)

2. information science abstracts (CISI)

Page 21: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Summary of analyses

• LSI performed better than or equal to simple term matching

• LSI was shown to be superior to system described by Voorhees

• LSI performed better than or equal to SMART

Page 22: Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy

Latent Semantic Indexing

Conclusion

LSI represents both terms and documents in the same space which provides for the retrieval of relevant information.

LSI does not rely on literal matching thus retrieves more relevant information than other methods.

LSI offers an adequate solution to the problem of synonymy but only a partial solution to the problem of polysemy.