1 cs 430: information discovery lecture 12 latent semantic indexing
TRANSCRIPT
![Page 1: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/1.jpg)
1
CS 430: Information Discovery
Lecture 12
Latent Semantic Indexing
![Page 2: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/2.jpg)
2
Course Administration
Thursday, October 16: Presidential Inauguration
Narayana N. R. MurthyChairman and Chief Mentor Officer
Infosys Technologies Limited
Cornell - The Unfinished Agenda: The Musings of a Corporate Person
Biotech Building, Large Conference Room10:00 to 11:00 a.m.
![Page 3: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/3.jpg)
3
Course Administration
Assignment 1
You should receive answers to questions by email
Newsgroup
If you have a general question about the assignment, send it to the newsgroup:
News server: newsstand.cit.cornell.edu
Newsgroup: cornell.class.cs430
![Page 4: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/4.jpg)
4
Course Administration
Midterm Examination
A sample examination and discussion of the solution will be posted to the Web site.
Assignment 2
It is not the job of the Teaching Assistants to do the assignments for you. They may answer your question, "This is a matter for you to judge," or "This was covered in the lectures." Use the report to explain your choices.
We will post answers to general questions on the news group.
![Page 5: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/5.jpg)
5
Probabilistic Principle
Given a query q and a document dj the model needs an estimate of the probability that the user finds dj relevant. i.e., P(R | dj).
similarity (dj, q) =
= by Bayes Theorem
= x k where k is constant
P(R | dj)
P(R | dj)
P(dj | R) P(R)
P(dj | R) P(R)
P(dj | R)
P(dj | R)
P(dj | R) is the probability of randomly selecting dj from R.
![Page 6: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/6.jpg)
6
Binary Independence Retrieval Model (BIR)
Let x = (x1, x2, ... xn) be the term incidence vector for dj. xi = 1 if term i is in the document and 0 otherwise.
We estimate P(dj | R) by P(x | R)
If the index terms are independent
P(x | R) = P(x1 | R) P(x2 | R) ... P(xn | R) = ∏ P(xi | R)
![Page 7: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/7.jpg)
7
Binary Independence Retrieval Model (BIR)
∏ P(xi | R)
∏ P(xi | R)
Since the xi are either 0 or 1, this can we written:
P(xi = 1 | R) P(xi = 0 | R)
xi = 1 P(xi = 1 | R) xi = 0 P(xi = 0 | R)
S = k ∏ ∏
S = similarity (dj, q) = k
![Page 8: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/8.jpg)
8
Binary Independence Retrieval Model (BIR)
For terms that appear in the query let
pi = P(xi = 1 | R)
ri = P(xi = 1 | R)
For terms that do not appear in the query assume
pi = ri
pi 1 - pi
xi = qi = 1 ri xi = 0, qi = 1 1 - ri
pi (1 - ri) 1 - pi xi = qi = 1 ri(1 - pi) qi = 1 1 - ri
S = k
= k
∏ ∏
∏ ∏constant for a given query
![Page 9: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/9.jpg)
9
Binary Independence Retrieval Model (BIR)
Taking logs and ignoring factors that are constant for a given query, we have:
pi (1 - ri ) (1 - pi) ri
where the summation is taken over those terms that appear in both the query and the document.
This similarity measure can be used to rank all documents against the query q.
similarity (d, q) = ∑ log
![Page 10: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/10.jpg)
10
Estimates of P(xi | R)
Initial guess, with no information to work from:
pi = P(xi | R) = c
ri = P(xi | R) = ni / N
where:
c is an arbitrary constant, e.g., 0.5ni is the number of documents that contain xi
N is the total number of documents in the collection
![Page 11: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/11.jpg)
11
Improving the Estimates of P(xi | R)
Human feedback -- relevance feedback (discussed later)
Automatically
(a) Run query q using initial values. Consider the t top ranked documents. Let si be the number of these documents that contain the term xi.
(b) The new estimates are:
pi = P(xi | R) = si / t
ri = P(xi | R) = (ni - si) / (N - t)
![Page 12: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/12.jpg)
12
Discussion of Probabilistic Model
Advantages
• Based on firm theoretical basis
Disadvantages
• Initial values have to be guessed.
• Weights ignore term frequency
• Assumes independent index terms
![Page 13: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/13.jpg)
13
Latent Semantic Indexing
Objective
Replace indexes that use sets of index terms by indexes that use concepts.
Approach
Map the index term vector space into a lower dimensional space, using singular value decomposition.
![Page 14: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/14.jpg)
14
Deficiencies with Conventional Automatic Indexing
Synonymy: Various words and phrases refer to the same concept (lowers recall).
Polysemy: Individual words have more than one meaning (lowers precision)
Independence: No significance is given to two terms that frequently appear together
![Page 15: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/15.jpg)
15
Example
Query: "IDF in computer-based information look-up"
Index terms for a document: access, document, retrieval, indexing
How can we recognize that information look-up is related to retrieval and indexing?
Conversely, if information has many different contexts in the set of documents, how can we discover that it is an unhelpful term for retrieval?
![Page 16: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/16.jpg)
16
Technical Memo Example: Titles
c1 Human machine interface for Lab ABC computer applications
c2 A survey of user opinion of computer system response time
c3 The EPS user interface management system
c4 System and human system engineering testing of EPS
c5 Relation of user-perceived response time to error measurement
m1 The generation of random, binary, unordered trees
m2 The intersection graph of paths in trees
m3 Graph minors IV: Widths of trees and well-quasi-ordering
m4 Graph minors: A survey
![Page 17: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/17.jpg)
17
Technical Memo Example: Terms and Documents
Terms Documents
c1 c2 c3 c4 c5 m1 m2 m3 m4human 1 0 0 1 0 0 0 0 0interface 1 0 1 0 0 0 0 0 0computer 1 1 0 0 0 0 0 0 0user 0 1 1 0 1 0 0 0 0system 0 1 1 2 0 0 0 0 0response 0 1 0 0 1 0 0 0 0time 0 1 0 0 1 0 0 0 0EPS 0 0 1 1 0 0 0 0 0survey 0 1 0 0 0 0 0 0 1trees 0 0 0 0 0 1 1 1 0graph 0 0 0 0 0 0 1 1 1minors 0 0 0 0 0 0 0 1 1
![Page 18: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/18.jpg)
18
Technical Memo Example: Query
Query:
Find documents relevant to "human computer interaction"
Simple Term Matching:
Matches c1, c2, and c4Misses c3 and c5
![Page 19: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/19.jpg)
19
t1
t2
t3
d1 d2
The space has as many dimensions as there are terms in the word list.
The index term vector space
![Page 20: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/20.jpg)
20
Models of Semantic Similarity
Proximity models: Put similar items together in some space or structure
• Clustering (hierarchical, partition, overlapping). Documents are considered close to the extent that they contain the same terms. Most then arrange the documents into a hierarchy based on distances between documents. [Covered later in course.]
• Factor analysis based on matrix of similarities between documents (single mode).
• Two-mode proximity methods. Start with rectangular matrix and construct explicit representations of both row and column objects.
![Page 21: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/21.jpg)
21
Selection of Two-mode Factor Analysis
Additional criterion:
Computationally efficient O(N2k3)
N is number of terms plus documentsk is number of dimensions
![Page 22: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/22.jpg)
22
Figure 1
• term
document
query
--- cosine > 0.9
![Page 23: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/23.jpg)
23
Mathematical concepts
Singular Value Decomposition
Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents).
There exist matrices T, S and D', such that:
X = T0S0D0'
T0 and D0 are the matrices of left and right singular vectorsT0 and D0 have orthonormal columns
S0 is the diagonal matrix of singular values
![Page 24: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/24.jpg)
24
Dimensions of matrices
X = T0
D0'S0
t x d t x m m x dm x m
m is the rank of X < min(t, d)
![Page 25: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/25.jpg)
25
Reduced Rank
Diagonal elements of S0 are positive and decreasing in magnitude. Keep the first k and set the others to zero.
Delete the zero rows and columns of S0 and the corresponding rows and columns of T0 and D0. This gives:
X X = TSD'
Interpretation
If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymy, and recognizes dependence.
~~ ^
^
![Page 26: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/26.jpg)
26
Selection of singular values
X =
t x d t x k k x dk x k
k is the number of singular values chosen to represent the concepts in the set of documents.
Usually, k « m.
T
S D'
^
![Page 27: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/27.jpg)
27
Comparing Two Terms
XX' = TSD'(TSD')'
= TSD'DS'T'
= TSS'T Since D is orthonormal
= TS(TS)'
To calculate the i, j cell, take the dot product between the i and j rows of TS
Since S is diagonal, TS differs from T only by stretching the coordinate system
^
^
The dot product of two rows of X reflects the extent to which two terms have a similar pattern of occurrences.
^
![Page 28: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/28.jpg)
28
Comparing Two Documents
X'X = (TSD')'TSD'
= DS(DS)'
To calculate the i, j cell, take the dot product between the i and j columns of DS.
Since S is diagonal DS differs from D only by stretching the coordinate system
^^
The dot product of two columns of X reflects the extent to which two columns have a similar pattern of occurrences.
^
![Page 29: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/29.jpg)
29
Comparing a Term and a Document
Comparison between a term and a document is the value of an individual cell of X.
X = TSD'
= TS(DS)'
where S is a diagonal matrix whose values are the square root of the corresponding elements of S.
^
- -
-
^
![Page 30: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/30.jpg)
30
Technical Memo Example: Query
Terms Query xq
human 1interface 0computer 0user 0system 1response 0time 0EPS 0survey 0trees 1graph 0minors 0
Query: "human system interactions on trees"
In term-document space, a query is represented by xq, a t x 1 vector.
In concept space, a query is represented by dq, a 1 x k vector.
![Page 31: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/31.jpg)
31
Query
Suggested form of dq is:
dq = xq'TS-1
Example of use. To compare a query against document i, take the ith element of the product of DS and dqS, which is the ith element of product of DS and xq'T.
Note that is a dq row vector.
![Page 32: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/32.jpg)
32
Experimental Results
Deerwester, et al. tried latent semantic indexing on two test collections, MED and CISI, where queries and relevant judgments available.
Documents were full text of title and abstract.
Stop list of 439 words (SMART); no stemming, etc.
Comparison with: (a) simple term matching, (b) SMART, (c) Voorhees method.
![Page 33: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/33.jpg)
33
Experimental Results: 100 Factors
![Page 34: 1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing](https://reader035.vdocuments.us/reader035/viewer/2022062313/56649f495503460f94c6ac28/html5/thumbnails/34.jpg)
34
Experimental Results: Number of Factors