modeling (chap. 2) modern information retrieval spring 2000
TRANSCRIPT
![Page 1: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/1.jpg)
Modeling (Chap. 2)Modern Information Retrieval
Spring 2000
![Page 2: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/2.jpg)
Introduction Traditional IR systems adopt index
terms to index, retrieve documents An index term is simply any word
that appears in text of documents Retrieval based on index terms is
simple premise is that semantics of documents and
user information can be expressed through set of index terms
![Page 3: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/3.jpg)
Key Question semantics in document (user request)
lost when text replaced with set of words matching between documents and user
request done in very imprecise space of index terms (low quality retrieval)
problem worsened for users with no training in properly forming queries (cause of frequent dissatisfaction of Web users with answers obtained)
![Page 4: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/4.jpg)
Taxonomy of IR Models Three classic models
Boolean documents and queries represented
as sets of index terms Vector
documents and queries represented as vectors in t-dimensional space
Probabilistic document and query representations
based on probability theory
![Page 5: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/5.jpg)
Basic Concepts Classic models consider that each
document is described by index terms
Index term is a (document) word that helps in remembering document’s main themes index terms used to index and summarize
document content in general, index terms are nouns (because
meaning by themselves) index terms may consider all distinct words
in a document collection
![Page 6: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/6.jpg)
Distinct index terms have varying relevance when describing document contents
Thus numerical weights assigned to each index term of a document
Let ki be index term, dj document, and wi,j 0 be weight for pair (ki, dj)
Weight quantifies importance of index term for describing document semantic contents
![Page 7: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/7.jpg)
Definition (pp. 25)
Let t be no. of index terms in system and k i be generic index term.
K = {k1, …, kt} is set of all index terms.
A weight wi,j > 0 associated with each index term ki of document dj.
For index term that does not appear in document text, wi,j = 0.
Document dj associated with index term vector j represented by j = (w1,j, w2,j, …wt,j)
d
d
![Page 8: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/8.jpg)
Boolean Model Simple retrieval model based on set
theory and Boolean algebra framework is easy to grasp by users
(concept of set is intuitive) Queries specified as Boolean
expressions which have precise semantics
![Page 9: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/9.jpg)
Drawbacks Retrieval strategy is binary decision
(document is relevant/non-relevant) prevents good retrieval performance
not simple to translate information need into Boolean expression (difficult and awkward to express)
dominant model with commercial DB systems
![Page 10: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/10.jpg)
Boolean Model (Cont.) Considers that index terms are
present or absent in document index term weights are binary, I.e.
wi,j {0,1} query q composed of index terms
linked by not, and, or query is Boolean expression which
can be represented as DNF
![Page 11: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/11.jpg)
Boolean Model (Cont.) Query [q=ka (kb kc)] can be written in
DNF as [ dnf = (1,1,1) (1,1,0) (1,0,0)] each component is binary weighted vector
associated with tuple (ka, kb, kc)
binary weighted vectors are called conjunctive components of dnf
q
q
![Page 12: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/12.jpg)
Boolean Model (cont.) Index term weight variables are all binar
y, I.e. wi,j {0,1} query q is a Boolean expression Let dnf be DNF for query q
Let cc be any conjunctive components of dnf
Similarity of document dj to query q is sim(dj,q) = 1 if cc | ( cc dnf) (ki,gi( j) = gi
( cc)) where gi( j) = wi,j
sim(dj,q) = 0 otherwise
q
q
q
q
q
q
q d
d
![Page 13: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/13.jpg)
Boolean Model (Cont.)
If sim(dj,q) = 1 then Boolean model predict that document dj is relevant to query q (it might not be)
Otherwise, prediction is that document is not relevant
Boolean model predicts that each document is either relevant or non-relevant
no notion of partial match
![Page 14: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/14.jpg)
Main advantages clean formalism simplicity
Main disadvantages exact matching lead to retrieval of too
few or too many documents index term weighting can lead to impr
ovement in retrieval performance
![Page 15: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/15.jpg)
Vector Model Assign non-binary weights to index terms
in queries and documents term weights used to compute degree of
similarity between document and user query
by sorting retrieved documents in decreasing order (of degree of similarity), vector model considers partially matched documents ranked document answer set a lot more preci
se (than answer set by Boolean model)
![Page 16: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/16.jpg)
Vector Model (Cont.)
Weight wi,j for pair (ki, dj) is positive and non-binary
index terms in query are also weighted Let wi,q be weight associated with pair
[ki,q], where wi,q 0
query vector defined as = (w1,q, w2,q, …, wt,q) where t is total no. of index terms in system
vector for document dj is represented by j = (w1,j, w2,j, …, wt,j)
q
q
d
![Page 17: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/17.jpg)
Vector Model (Cont.)
Document dj and user query q represented as t-dimensional vectors.
evaluate degree of similarity of dj with regard to q as correlation between vectors j and .
Correlation can be quantified by cosine of angle between these two vectors sim(dj,q) =
d
q
|||| qjd
qjd
![Page 18: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/18.jpg)
Vector Model (Cont.)
Sim(q,dj) varies from 0 to +1. Ranks documents according to degree of
similarity to query document may be retrieved even if it
partially matches query establish a threshold on sim(dj,q) and retrieve
documents with degree of similarity above threshold
![Page 19: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/19.jpg)
Index term weights Documents are collection C of objects User query is set A of objects IR problem is to determine which
documents are in set A and which are not (I.e. clustering problem)
In clustering problem intra-cluster similarity (features which better
describe objects in set A) inter-cluster similarity (features which better
distinguish objects in set A from remaining objects in collection C
![Page 20: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/20.jpg)
In vector model, intra-cluster similarity quantified by measuring raw frequency of term ki inside document dj (tf factor) how well term describes document contents
inter-cluster dissimilarity quantified by measuring inverse of frequency of term ki
among documents in collection (idf factor) terms which appear in many documents are
not very useful for distinguishing relevant document from non-relevant one
![Page 21: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/21.jpg)
Definition (pp.29) Let N be total no. of documents in system let ni be number of documents in which
index term ki appears
let freqi,j be raw frequency of term ki in document dj
no. of times term ki mentioned in text of document dj
Normalized frequency fi,j of term ki in dj
fi,j =
jfreql
jfreqi
,max
,
![Page 22: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/22.jpg)
Maximum computed over all terms mentioned in text of document dj
if term ki does not appear in document dj then fi,j = 0
let idfi, inverse document frequency for ki be idfi = log
best known term weighting scheme wi,j = fi,j log
ni
N
ni
N
![Page 23: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/23.jpg)
Advantages of vector model term weighting scheme improves retrieval
performance retrieve documents that approximate query
conditions sorts documents according to degree of
similarity to query Disadvantage
index terms are mutually independent
![Page 24: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/24.jpg)
Probabilistic Model Given user query, there is set of
documents containing exactly relevant documents. Ideal answer set
given description of ideal answer set, no problem in retrieving its documents
querying process is process of specifying properties of ideal answer set the properties are not exactly known there are index terms whose semantics are
used to characterize these properties
![Page 25: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/25.jpg)
Probabilistic Model (Cont.) These properties not known at query
time effort has to be made to initially guess
what they (I.e. properties) are initial guess generate preliminary
probabilistic description of ideal answer set to retrieve first set of documents
user interaction initiated to improve probabilistic description of ideal answer set
![Page 26: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/26.jpg)
User examine retrieved documents and decide which ones are relevant
this information used to refine description of ideal answer set
by repeating this process, such description will evolve and be closer to ideal answer set
![Page 27: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/27.jpg)
Fundamental Assumption
Given user query q and document dj in collection, probabilistic model estimate probability that user will find document dj
relevant assumes that probability of relevance depends on
query and document representations only assumes that there is subset of all documents
which user prefers as answer set for query q such ideal answer set is labeled R documents in set R are predicted to be relevant to
query
![Page 28: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/28.jpg)
Given query q, probabilistic model
assigns to each document dj the ratio P(dj relevant-to q)/P(dj non-relevant-to q) measure of similarity to query odds of document dj being relevant to
query q
![Page 29: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/29.jpg)
Index term weight variables are all binary I.e. wi,j {0,1}, wi,q {0,1}
query q is subset of index terms let R be set of documents known (initially
guessed) to be relevant let be complement of R let P(R| j) be probability that document d j
is relevant to query q let P( | j) be probability that document
dj not relevant to query q.
R
d
R d
![Page 30: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/30.jpg)
Similarity sim(dj,q) of document dj to query q is ratio
sim(dj,q) =
sim(dj,q) ~
sim(dj,q) ~ wi,q wi,j
)|(
)|(
jdRP
jdRP
)|(
)|(
RjdP
RjdP
)|(
)|(1log
)|(1
)|(log
RkiP
RkiP
RkiP
RkiP
t
i 1
![Page 31: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/31.jpg)
How to compute P(ki|R) and P(ki| ) initially ? assume P(ki|R) is constant for all index terms
ki (typically 0.5)
P(ki|R) = 0.5
assume distribution of index terms among non-relevant documents approximated by distribution of index terms among all documents in collection
P(ki| ) = ni/N where ni is no. of documents containing index term ki; N is total no. of doc.
R
R
![Page 32: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/32.jpg)
Let V be subset of documents initially retrieved and ranked by model
let Vi be subset of V composed of documents in V with index term ki
P(ki|R) approximated by distribution of index term ki among doc. retrieved P(ki|R) = Vi / V
P(ki| ) approximated by considering all non-retrieved doc. are not relevant P(ki| ) =
R
VN
Vini
R
![Page 33: Modeling (Chap. 2) Modern Information Retrieval Spring 2000](https://reader031.vdocuments.us/reader031/viewer/2022020921/56649dff5503460f94ae76e7/html5/thumbnails/33.jpg)
Advantages documents ranked in decreasing order of
their probability of being relevant Disadvantages
need to guess initial separation of relevant and non-relevant sets
all index term weights are binary index terms are mutually independent