ir theory - knu
TRANSCRIPT
IR Theory:IR Basics & Overview of IR Models
IR Approach
Search Engine 2
Information Seeker Authors
Information Need Concepts
Query String Documents
Is the document relevant to the query?
IR System Architecture
Search Engine 3
DocumentsQuery
Results
RepresentationModule
RepresentationModule
Matching/RankingModule
DocumentRepresentation
Query Representation
IR Step 1: Representation
Search Engine 4
DocumentsQuery
Results
RepresentationModule
RepresentationModule
Matching/RankingModule
DocumentRepresentation
Query Representation
How to represent text? How do we represent the complexities of language?
Computers don’t “understand” documents or queries
Simple, yet effective approach: “bag of words” Treat all the words in a document as index terms for that document Disregard order, structure, meaning, etc. of the words
Search Engine 5
McDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.…
16 × said 14 × McDonalds12 × fat11 × fries8 × new6 × company french nutrition5 × food oil percent reduce
taste Tuesday…
Bag of Words
Bag-of-Word Representation
Search Engine 6
The quick brown fox jumped over the lazy dog’s back.
Document 1
Document 2
Now is the time for all good men to come to the aid of their party.
the
isfor
to
of
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110110110010100
11001001001101011
Term Docu
men
t 1
Docu
men
t 2
StopwordList
IR Step 2: Term Weighting
Search Engine 7
DocumentsQuery
Results
RepresentationModule
RepresentationModule
Matching/RankingModule
DocumentRepresentation
Query Representation
Term Weight: What & How?
What is term weight? Numerical estimate of term importance
How should we estimate the term importance? Terms that appear often in a document should get high weights
• The more often a document contains the term “dog”, the more likely that the document is “about” dogs.
Terms that appear in many documents should get low weights• Words like “the”, “a”, “of” appear in (nearly) all documents.
Term frequency in long documents should count less than those in short ones
How do we compute it? Term frequency (tf) Inverse document frequency (idf) Document length (dl)
Search Engine 8
IR Step 3: Matching & Ranking
Search Engine 9
DocumentsQuery
Results
RepresentationModule
RepresentationModule
Matching/RankingModule
DocumentRepresentation
Query Representation
IR Models Boolean Model ← Boolean Logic + Set Theory
Query: logical expression of terms (e.g., a AND b) Document: a set of terms Search Result → a set of documents satisfying query expression
Vector Space Model Document & query as vector of terms Search Result → documents ranked by query-document similarity
Probabilistic Model In practice, similar to VSM except using probabilistic term weight Search Result → documents ranked by probability of relevance
Search Engine 10
Boolean Model: Overview
Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document
Build queries by combining terms with Boolean operators AND, OR, NOT
The system returns all documents that satisfy the query
Search Engine 11
A B
A OR B
A AND B
A AND NOT(B)
Boolean Model: Strength Boolean operators define the relationship between query terms.
AND → terms/concepts that are not equivalent/similar• party AND good: good party• Retrieves records that include all AND terms → Narrows the search
OR → related terms, synonyms• party AND (good OR excellent OR wild): good party, excellent party, wild party• Retrieves records that include any OR terms → Broadens the search
NOT → antonyms, alternate terms for polysemes• party NOT republican: Republican party• Eliminates records that include NOT term → Narrows the search
Precise, if you know the right strategies knows what concepts to combine/exclude, narrow/broaden
Efficient for the computer
Search Engine 12
Boolean Model: Weakness
Natural language is way more complex Boolean logic is insufficient to capture the richness of language AND “discovers” nonexistent relationships
• Terms in different sentences, paragraphs, …→ Money is good, but I won’t be party to stealing.
Guessing terminology for OR is hard• good, nice, excellent, outstanding, awesome, …
Guessing terms to exclude (NOT) is even harder!• Republican party, party to a lawsuit, …
No control over the size of result set Too many documents or none All documents in the result set are considered “equally good”
No Partial Matching Documents that “don’t quite match” the query may also be useful.
Search Engine 13
Vector Space Model: Overview
Documents are represented as vectors of terms. Query is also represented as a term vector. Documents are ranked by their similarity to the query.
Similarity = cosine of the angles between document & query vectors
Vector space is n-dimensional where n = number of terms in the collection Terms form orthogonal vectors along the axes↔ Term independence assumption
Search Engine 14
Vector Space Model: Representation
“Bags of words” can be represented as vectors Computational efficiency Ease of manipulation Geometric metaphor: “arrows”
A vector is a set of values recorded in any consistent order
Search Engine 15
“The quick brown fox jumped over the lazy dog’s back”
→ (1, 1, 1, 1, 1, 1, 1, 1, 2)
1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”
back 1
brown 1
dog 1
fox 1
jump 1
lazy 1
over 1
quick 1
the 2
Bag of wordsVector
Vector Space Model: Retrieval
Order documents by “relevance” Relevance = how likely they are to be relevant to the information need Some documents are “better” than others Users can decide when to stop reading
Best (partial) match Documents need not have all query terms Documents with more query terms should be “better”
Estimate relevance with query-document similarity1. Treat the query as if it were a document
• Create a query bag-of-words• Compute term weights
2. Find its similarity to each document3. Rank order the documents by similarity
• Works surprisingly well
Search Engine 16
Vector Space Model: 3-D Example
Search Engine17
y
x
z
A vector A in a 3-dimensional space• Represented with initial point at the origin of a rectangular coordinate system. • Projections of A on the x, y, and z axes: Ax, Ay, and Az
− the (rectangular) components of A in the x, y, and z directions− each axis represents a term (e.g., x = all, y = brown, z = cat)
A
Ax
Ay
Az
Vector Space Model: Postulate
Search Engine 18
Documents that are “close together” in vector space “talk about” the same things
t2
d2
d1
d3
d4
d5
t3
t1
θ
Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)
Vector Space Model: Example
Search Engine 19
t2
d3d1
d2
q
t3
t1
Query: What is information retrieval?Q: Information 1, retrieval 1
Index Term d1 d2 d3
t1 (information) 1 1 1
t2 (retrieval) 1 2 0
t3 (seminar) 1 1 1
D1: Information retrieval seminarsD2: Retrieval seminars and Information RetrievalD3: Information seminar
d3 q
√𝟐𝟐 √𝟐𝟐
√𝟐𝟐
qd1
√𝟐𝟐√𝟑𝟑
𝟏𝟏
q
d2
√𝟐𝟐
√𝟔𝟔
√𝟐𝟐
θ = 60
θ ≈ 35
θ ≈ 30
Vector Space Model: Pros & Cons
Pros Non-binary term weights Partial matching Ranked results Easy query formulation Query expansion
Cons Term relationships ignored Term order ignored No wildcard Problematic w/ long
documents
Similarity ≠ Relevance
Search Engine 20
Boolean vs. Vector Space Model Boolean model
Based on the notion of sets• Does not impose a ranking on retrieved documents
Documents are retrieved only if they satisfy Boolean conditions specified in the query • Exact match
Vector space model Based on geometry, the notion of vectors in high dimensional space Documents are ranked based on their similarity to the query Best/partial match
Search Engine 21
Probabilistic Model: Retrieval
Probability Ranking Principle (Robertson, 1977)
Assumptions• Each document is either relevant or not relevant
Ranking by document’s probability of relevance to the query→ Maximizes the overall retrieval effectiveness
Binary Independence Retrieval Model Rank documents (binary term vector d’s) by relevance odds
• P(R|d) = probability that document d is relevant• ti = ith term of n terms (1 if it occurs in d, 0 otherwise)• pi = probability that ti occurs in a relevant document• qi = probability that ti occurs in a non-relevant document
Search Engine 22
𝑂𝑂 𝑅𝑅 𝑑𝑑 =𝑃𝑃(𝑅𝑅|𝑑𝑑)𝑃𝑃( �𝑅𝑅|𝑑𝑑)
= �𝑖𝑖=1
𝑛𝑛
𝑡𝑡𝑖𝑖 × log𝑝𝑝𝑖𝑖(1 − 𝑞𝑞𝑖𝑖)𝑞𝑞𝑖𝑖(1 − 𝑝𝑝𝑖𝑖)
Click to see Derivation
Probabilistic Model: Term Weight
Term Relevance Weight Term weight based on Binary Independence Retrieval Model
→ leverage relevance information to weight terms
• trk = Term Relevance Weight of term k• O(pk) = odds that term k appears in a relevant document• O(qk) = odds that term k appears in a non-relevant document
Probability estimation without relevance information• 𝑞𝑞𝑘𝑘 ≈ �𝑑𝑑𝑘𝑘
𝑁𝑁𝑑𝑑 (assume #non-rel docs = collection size)
• 𝑝𝑝𝑘𝑘 ≈ 0.5 (assume random chance) qk = probability that term k occurs in a non-relevant document
pk = probability that term k occurs in a relevant document
dk = number of documents in which term k occurs
Nd = number of documents in the collection (i.e., collection size)
Search Engine 23
𝑡𝑡𝑡𝑡𝑘𝑘 = log 𝑝𝑝𝑘𝑘(1−𝑞𝑞𝑘𝑘)𝑞𝑞𝑘𝑘(1−𝑝𝑝𝑘𝑘)
=log 𝑂𝑂(𝑝𝑝𝑘𝑘)𝑂𝑂(𝑞𝑞𝑘𝑘)
≅ log𝑁𝑁𝑑𝑑−𝑑𝑑𝑘𝑘𝑑𝑑𝑘𝑘
Click to see Derivation
Appendix
Search Engine 24
25
Binary Independence Retrieval Model
)|()|(
)()(
RdPRdP
RPRP
⋅= )(RO=
==otherwise
dtobelongstiftwherettd iin ,0
,1),,...,( 1
)|()|()|(
dRPdRPdRO
=
)()|()(
)()|()(
dPRdPRP
dPRdPRP
=)|()|(
1 RtPRtP
i
in
i=∏
apply Bayes Theorem assume Term Independence
P(R) = probability that a (randomly selected) document is relevantP(d) = probability that document d is observedP(R|d) = probability that document d is relevantP(d|R) = probability of observing document d from the set of relevant documents R
= probability that a (randomly selected) document from R is document d
split by presence/absence of terms
)|0()|0(
)|1()|1()(
01 RtPRtP
RtPRtPRO
i
i
ti
i
t ii ==
∏==
∏===
Ignore , which is constant across documents, and take the log:
)|1( RtPp ii ==)|1( RtPq ii ==
i
i
ti
i
t qp
qpROdRO
ii −−
∏∏=== 1
1)()|(01
ii
ii
iit
i
i
t
i
in
ii
i
in
i
ii
in
i
ti
ti
ti
ti
n
i qp
qpRO
tifqp
tifqp
qqppRO
−
=
=
=
−
−
=
−−
∏=
=−−
∏
=∏⇐
−−
∏=1
1
1
1
1
1
1 11)(
0,)1()1(
1,
)1()1()(
Let : probability that ti occurs in a relevant document: probability that ti occurs in a non-relevant document
)()()(
RPRPRO =
∑=
−−
=
−−
=
−−
∏=
n
i
t
i
i
t
i
i
t
i
i
t
i
in
i
iiii
qp
qp
qp
qpdRSV
1
11
1 11log
11log)( ∑∑
==
−−
+−−
−=
−−
−+=n
i i
i
i
ii
i
ii
n
i i
ii
i
ii q
pqpt
qpt
qpt
qpt
11 11log
11loglog
11log)1(log
∑ ∑∑= == −
−+
−−
=
−−
+
−−
=n
i
n
i i
i
ii
iii
n
i i
i
i
i
i
i
i qp
pqqpt
qp
qp
qp
t1 11 1
1log)1()1(log
11log
11log Ignore ,
which is constant across documents
∑= −
−n
i i
i
qp
1 11log
∑==
=+++==
∏
n
iinni
n
ixxxxxxxx
12121
1loglog..loglog)..log(log
Search Engine
26
Term Relevance Weight
∑= −
−=
n
i ii
iii pq
qptdRSV1 )1(
)1(log)(
trk = Log odds ratio
Probability Estimation for Term Relevance Weight Without relevance information
• Known→ Nd = total number of documents in collection (collection size)→ dk = number of documents in which term k appears (postings)
• Estimated→ qk = probability that term k occurs in a non-relevant document
qk ≈ dk / Nd (conventional) qk = (dk+0.5) / (Nd+1) assume that number of non-relevant documents ≈ collection size
→ pk = probability that term k occurs in a relevant document pk ≈ 0.5 assume random chance
• Same as idf for large Nd
)()(log
)1(log
)1(log)1(log
)1(log
)1()1(log
k
k
k
k
k
k
k
k
k
k
kk
kkk qO
pOq
qp
pq
qp
ppqqptr =
−−
−=
−+
−=
−−
=
− documentrelevantnonainappearsktermthatodds
documentrelevantainappearsktermthatoddslog
k
kd
d
k
d
kd
d
k
d
k
k
k
k
k
kk
kkk d
dN
NdN
dN
Nd
Nd
pp
pqqptr −
=
−
+=−
+−
=−
+−
=−−
= loglog01
log5.01
5.0log)1(log)1(
log)1()1(log
5.05.0log
5.0)5.0(1log
15.015.01
log0)1(log)1(
log)1()1(log:
++−
=+
+−+=
++
++−
+=−
+−
=−−
=k
kd
k
kd
d
k
d
k
k
k
k
k
kk
kkk d
dNd
dN
Nd
Nd
pp
pqqptralConvention
Search Engine