IR Theory: IR Basics & Vector Space Model
IR Approach
Search Engine 2
Information Seeker Authors
Information Need Concepts
Query String Document Text
Is the document relevant to the query?
IR System Architecture
Search Engine 3
Documents Query
Results
Representation Module
Representation Module
Matching/Ranking Module
Document Representation
Query Representation
Step 1: Representation
Search Engine 4
Documents Query
Results
Representation Module
Representation Module
Matching/Ranking Module
Document Representation
Query Representation
How to represent text? How do we represent the complexities of language?
► Computers don’t “understand” documents or queries Simple, yet effective approach: “bag of words”
► Treat all the words in a document as index terms for that document ► Disregard order, structure, meaning, etc. of the words
Search Engine 5
McDonald's slims down spuds Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier. But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste. Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. …
16 × said 14 × McDonalds 12 × fat 11 × fries 8 × new 6 × company french nutrition 5 × food oil percent reduce
taste Tuesday …
Bag of Words
Bag-of-Word Representation
Search Engine 6
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
the
is for
to
of
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0
1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 1 1
Term D
ocum
ent 1
Doc
umen
t 2
Stopword List
Step 2: Term Weighting
Search Engine 7
Documents Query
Results
Representation Module
Representation Module
Matching/Ranking Module
Document Representation
Query Representation
Term Weight: What & How? What is term weight?
► Numerical estimate of term importance How should we estimate the term importance?
► Terms that appear often in a document should get high weights • The more often a document contains the term “dog”, the more likely that the
document is “about” dogs.
► Terms that appear in many documents should get low weights • Words like “the”, “a”, “of” appear in (nearly) all documents.
► Term frequency in long documents should count less than those in short ones How do we compute it?
► Term frequency (tf) ► Inverse document frequency (idf) ► Document length (dl)
Search Engine 8
Step 3: Matching/Ranking
Search Engine 9
Documents Query
Results
Representation Module
Representation Module
Matching/Ranking Module
Document Representation
Query Representation
Boolean vs. Vector Space Model Boolean model
► Based on the notion of sets • Does not impose a ranking on retrieved documents
► Documents are retrieved only if they satisfy Boolean conditions specified in the query
• Exact match
Vector space model ► Based on geometry, the notion of vectors in high dimensional space ► Documents are ranked based on their similarity to the query ► Best/partial match
Search Engine 10
Boolean Model: Overview Weights assigned to terms are either “0” or “1”
► “0” represents “absence”: term isn’t in the document ► “1” represents “presence”: term is in the document
Build queries by combining terms with Boolean operators ► AND, OR, NOT
The system returns all documents that satisfy the query
Search Engine 11
A B
A OR B
A AND B
A AND NOT(B)
Boolean Model: Strength Boolean operators define the relationship between query terms.
► AND → terms/concepts that are not equivalent/similar • party AND good: good party • Retrieves records that include all AND terms → Narrows the search
► OR → related terms, synonyms • party AND (good OR excellent OR wild): good party, excellent party, wild party • Retrieves records that include any OR terms → Broadens the search
► NOT → antonyms, alternate terms for polysemes • party NOT democratic: Democratic party • Eliminates records that include NOT term → Narrows the search
Precise, if you know the right strategies ► knows what concepts to combine/exclude, narrow/broaden
Efficient for the computer
Search Engine 12
Boolean Model: Weakness Natural language is way more complex
► Boolean logic insufficient to capture the richness of language ► AND “discovers” nonexistent relationships
• Terms in different sentences, paragraphs, … Money is good, but I won’t be party to stealing.
► Guessing terminology for OR is hard • good, nice, excellent, outstanding, awesome, …
► Guessing terms to exclude is even harder! • Democratic party, party to a lawsuit, …
No control over size of result set Too many documents or none All documents in the result set are considered “equally good”
No Partial Matching Documents that “don’t quite match” the query may also be useful.
Search Engine 13
Vector Space Model: Pros & Cons
Pros ► Non-binary term weights ► Partial matching ► Ranked results ► Easy query formulation ► Query expansion
Cons ► Term relationships ignored ► Term order ignored ► No wildcard ► Problematic w/ long
documents
► Similarity ≠ Relevance
Search Engine 14
Vector Space Model: Representation “Bags of words” can be represented as vectors
► Computational efficiency ► Ease of manipulation ► Geometric metaphor: “arrows”
A vector is a set of values recorded in any consistent order
Search Engine 15
“The quick brown fox jumped over the lazy dog’s back”
→ (1, 1, 1, 1, 1, 1, 1, 1, 2)
1st position corresponds to “back” 2nd position corresponds to “brown” 3rd position corresponds to “dog” 4th position corresponds to “fox” 5th position corresponds to “jump” 6th position corresponds to “lazy” 7th position corresponds to “over” 8th position corresponds to “quick” 9th position corresponds to “the”
back 1
brown 1
dog 1
fox 1
jump 1
lazy 1
over 1
quick 1
the 2
Bag of words Vector
Vector Space Model: Ranked Retrieval Order documents by “relevance”
► Relevance = how likely they are to be relevant to the information need ► Some documents are “better” than others ► Users can decide when to stop reading
Best (partial) match ► Documents need not have all query terms ► Documents with more query terms should be “better”
Estimate relevance with query-document similarity 1. Treat the query as if it were a document
• Create a query bag-of-words • Compute term weights
2. Find its similarity to each document 3. Rank order the documents by similarity
• Works surprisingly well
Search Engine 16
Vector Space Model: 3-D Example
Search Engine 17
y
x
z
A vector A in a 3-dimensional space • Represented with initial point at the origin of a rectangular coordinate system. • Projections of A on the x, y, and z axes: Ax, Ay, and Az
− the (rectangular) components of A in the x, y, and z directions − each axis represents a term (e.g., x = all, y = brown, z = cat)
A
Ax
Ay
Az
Vector Space Model: Postulate
Search Engine 18
Documents that are “close together” in vector space “talk about” the same things
t2
d2
d1 d3
d4
d5
t3
t1
θ
Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)
Vector Space Model: Example
Search Engine 19
t2
d3 d1
d2
q
t3
t1
Query: What is information retrieval? Q: Information 1, retrieval 1
Index Term d1 d2 d3
t1 (information) 1 1 1
t2 (retrieval) 1 2 0
t3 (seminar) 1 1 1
D1: Information retrieval seminars D2: Retrieval seminars and Information Retrieval D3: Information seminar
d3 q
√𝟐 √𝟐
√𝟐
q d1
√𝟐 √𝟑
𝟏
q
d2
√𝟐
√𝟔
√𝟐
θ = 60
θ ≈ 36
θ ≈ 29
Similarity Measures: Set-based Simple matching function
Dice’s coefficient
Jaccard’s coefficient
A = (wd1, wd2, wd3, wd4, wd5) B = (wd2, wd4, wd6) A ∩ B: intersection of A and B
• the set of elements that belongs to both A and B • A ∩ B = (wd2, wd4)
A ∪ B: union of A and B • the set of elements that belongs to either A or B. • A ∪ B = (wd1, wd2, wd3, wd4, wd5, wd6)
|A| : cardinality of A • the number of elements in A • |A| = 5 • |B| = 3 • |A ∩ B| = 2 • |A ∪ B| = 6
Similarity Scores • Simple: |A ∩ B| = 2 • Dice: 2* |A ∩ B| / (|A|+ |B|) = 2*2 /8 = 1/2 • Jaccard: |A ∩ B| / |A ∪ B| = 2/6 = 1/3
Search Engine 20
BABASIM ∩=),(
BABA
BASIM+∩
=2
),(
BABA
BASIM∪∩
=),(
Similarity Measures: Set-based Example
Search Engine 21
Simple matching function
By Dice’s coefficient
By Jaccard’s coefficient
Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) |O1| = 4 O2 = (1, 0, 0, 0, 1, 1, 0, 0) |O2| = 3 O3 = (1, 0, 0, 1, 1, 1, 0, 0) |O3| = 4 O4 = (1, 1, 0, 1, 0, 1, 1, 0) |O4| = 5 O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O5| = 6 |O1∩ O2| = |(A1)| = 1 |O1∩ O3| = |(A1, A4)| = 2 |O1∩ O4| = |(A1, A4)| = 2 |O1∩ O5| = |(A1, A3, A4 , A8)| = 4 |O1∪ O2| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1∪ O3| = |(A1, A3, A4, A5, A6, A8)| = 6 |O1∪ O4| = |(A1, A2, A3, A4, A6, A7, A8)| = 7 |O1∪ O5| = |(A1, A2, A3, A4, A7, A8)| = 6
BABASIM ∩=),(
BABA
BASIM+∩
=2
),(
BABA
BASIM∪∩
=),(
O2 O3 O4 O5
SIM 1 2 2 4
Rank 4 2 2 1
O2 O3 O4 O5
SIM 2*1/(4+3)=2/7 2*2/(4+4)=4/8 2*2/(4+5)=4/9 2*4/(4+6)=8/10
Rank 4 2 3 1
O2 O3 O4 O5
SIM 1/6 2/6 2/7 4/6
Rank 4 2 3 1
Similarity Measures: Vector-based Cosine Similarity (n-dimensional space)
Dot/Scalar product of vectors / product of vector lengths
• Dot product = sum (product of each axis component) A = (A1, A2, A3, A4) B = (B1, B2, B3, B4)
A•B = (A1B1+A2B2+A3B3+A4B4)
• Vector length = square root of sum (square of each axis component) |A| = sqrt [(A1)2+ (A2)2+ (A3)2+ (A4)2]
|B| = sqrt [(B1)2+ (B2)2+ (B3)2+ (B4)2]
Cosine Similarity (3-dimensional space) A = (Ax, Ay, Az) B = (Bx, By, Bz)
Search Engine 22
222222 )()()()()()(cos
zyxzyx
zzyyxx
BBBAAA
BABABA
++++
++=θ
∑∑
∑
==
==•
=n
ii
n
ii
n
iii
BA
BA
BABA
1
2
1
2
1
)()(cosθ
Similarity Measures: Vector-based Example
Search Engine 23
Compute cosine similarities
Rank objects O2 through O5 by descending order of similarity to O1
Object – Attribute (feature) array O1 = (1, 0, 1, 1, 0, 0, 0, 1) O2 = (1, 0, 0, 0, 1, 1, 0, 0) O3 = (1, 0, 0, 1, 1, 1, 0, 0) O4 = (1, 1, 0, 1, 0, 1, 1, 0) O5 = (1, 1, 1, 1, 0, 0, 1, 1) |O1| = sqrt(12+02+12+12+02+02+02+12) = sqrt(4) |O2| = sqrt(12+02+02+02+12+12+02+02) = sqrt(3) |O3| = sqrt(12+02+02+12+12+12+02+02) = sqrt(4) |O4| = sqrt(12+12+02+12+02+12+12+02) = sqrt(5) |O5| = sqrt(12+12+12+12+02+02+12+12) = sqrt(6) O1•O2 = (1*1+0*0+1*0+1*0+0*1+0*1+0*0+1*0) = 1 O1•O3 = (1*1+0*0+1*0+1*1+0*1+0*1+0*0+1*0) = 2 O1•O4 = (1*1+0*1+1*0+1*1+0*0+0*1+0*1+1*0) = 2 O1•O5 = (1*1+0*1+1*1+1*1+0*0+0*0+0*1+1*1) = 4
121
341),( 21 ==OOSIM
162
442),( 31 ==OOSIM
202
542),( 41 ==OOSIM
244
644),( 51 ==OOSIM ∑∑
∑
==
==•
=n
ii
n
ii
n
iii
BA
BA
BABA
1
2
1
2
1
)()(cosθ
Text Analysis: Word Frequency TREC Volume 3 Corpus
► Number of documents: 336,310 ► Total word occurrences: 125,720,891 ► Unique words: 508,209
Zipf Distribution ► Rank*Frequency = constant
• Population, Wealth, Popularity
► A few words are very common ► Most words are very rare
Term Weights ► Represents the ability of terms to identify relevant items
& to distinguish them from non-relevant material ► Very common & very rare words are not
very useful for indexing (Luhn, 1958) ► Good
• Smaller index → Faster retrieval
► Bad • Lost gems & broken phrases
Search Engine 24
B. Croft (Umass)
Text Analysis: Term Weighting Term Weighting Factors
► Term frequency (tf) • Number of times that a term occurs in a given document
tf(dogd1) = 2, tf(dogd2) = 1 tf(foxd1) = 3, tf(foxd2) = 0 tf(partyd1) = 0, tf(partyd2) = 1
► Inverse document frequency (idf)
• (Simple) 1/number of document in which a term occurs idf(dog) = 1/2, idf(fox) = 1/1, idf(party) = 1/1
• (Default) log(Nd/ number of document in which a term occurs) Nd = number of document in a collection idf(dog)=log(2/2)=0, idf(fox)=log(2/1)=0.3, idf(party) = log(2/1)=0.3
► Document length (dlen) • Number of tokens in a document
Token = an instance/occurrence of a word (not unique word) dlen(d1) = 11, dlen(d2) = 10
tf⋅idf formula
Search Engine 25
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
0 0 1 1 0 2 3 0 1 1 0 0 1 0 1 0 0
1 1 0 0 1 1 0 1 0 0 1 1 0 1 0 1 1
Term d1 d2
k
dkiki d
Nfidftfw log=∗=wki = weight of term k in document i fki = frequency of term k in document i (tf) Nd = number of documents in collection dk = number of documents in which term k appears (postings)
Similarity Measures: using Term Weights
Search Engine 26
1. Compute term weights (e.g., tf*idf)
• Nd = 5 • d1=5, d2=4, d3=1, d4=3, d5=2, d6=3, d7=4, d8=2
Document – Term array
k
dkiki d
Nfidftfw log=∗=
Similarity Measures: using Term Weights
Search Engine 27
2. Compute query-document cosine similarity with tf*idf weights
∑∑
∑
==
==•
=n
ii
n
ii
n
iii
BA
BA
BABA
1
2
1
2
1
)()(cosθ