ir models - tarjni.files.wordpress.com · list of docs match abstract tarjni vyas 2. ir models...
TRANSCRIPT
IR MODELSBY Prof.TARJNI VYAS
Tarjni Vyas 1
IntroductionDocs DB
Information Need
Index Terms
Doc
Query
Ranked
List of
Docs
matchabstract
Tarjni Vyas 2
IR Models
Non-Overlapping Lists
Proximal Nodes
Structured Models
Retrieval:
Adhoc
Filtering
Browsing
U
s
e
r
T
a
s
k
Classic Models
boolean
vector
probabilistic
Set Theoretic
Fuzzy
Extended Boolean
Probabilistic
Inference Network
Belief Network
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Browsing
Flat
Structure Guided
Hypertext
Tarjni Vyas 3
Specifying an IR Model
• Structure Quadruple [D, Q, F, R(qi, dj)]• D = Representation of documents
• Q = Representation of queries
• F = Framework for modeling representations and their relationships• Standard language/algebra/impl. type for translation to provide semantics
• Evaluation w.r.t. “direct” semantics through benchmarks
• R = Ranking function that associates a real number with a query-doc pair
Tarjni Vyas 4
About index terms
• Each document represented by a set of representative keywords orindex terms• Index terms meant to capture document’s main themes or semantics.
• Usually, index terms are nouns because nouns have meaning by themselves.
• However, search engines assume that all words are index terms (full textrepresentation)
• T1 = “conference”
• T2 = “crime”
• Adjectives, adverbs, conjunction, etc not useful.
Tarjni Vyas 5
Notations/Conventions
• Ki is an index term
• dj is a document
• t is the total number of docs
• K = (k1, k2, …, kt) is the set of all index terms
• wij >= 0 is the weight associated with (ki,dj)• wij = 0 if the term is not in the doc
• vec(dj) = (w1j, w2j, …, wtj) is the weight vector associated with the document dj
• gi(vec(dj)) = wij is the function which returns the weight associated with the pair (ki,dj)
Tarjni Vyas 7
The Boolean Model
• Simple model based on set theory
• Queries and documents specified as boolean expressions • precise semantics
• E.g., q = ka (kb kc)
• Terms are either present or absent. Thus, wij {0,1}
Tarjni Vyas 8
Example
• q = ka (kb kc)
• vec(qdnf) = (1,1,1) (1,1,0) (1,0,0)• Disjunctive Normal Form
• vec(qcc) = (1,1,0) • Conjunctive component
• Similar/Matching documents• md1 = [ka ka d e] => (1,0,0)
• md2 = [ka kb kc] => (1,1,1)
• Unmatched documents• ud1 = [ka kc] => (1,0,1)
• ud2 = [d] => (0,0,0)
Tarjni Vyas 9
Similarity/Matching function
sim(q,dj) = 1 if vec(dj) vec(qdnf))
0 otherwise
• Requires coercion for accuracy
Tarjni Vyas 10
Venn Diagram
q = ka (kb kc)
(1,1,1)(1,0,0)
(1,1,0)
Ka Kb
Kc
Tarjni Vyas 11
Drawback of Boolean model
• Expressive power of boolean expressions to capture information need and document semantics inadequate
• Retrieval based on binary decision criteria (with no partial match) does not reflect our intuitions behind relevance adequately
• As a result• Answer set contains either too few or too many documents in response to a
user query
• No ranking of documents
Tarjni Vyas 12
Vector Model• Task:
• Document collection
• Query specifies information need: free text
• Relevance judgments: depends upon the weighting scheame for all docs
• Word evidence: Bag of words• No ordering information
Tarjni Vyas 14
Vector Space Model
• Represent documents and queries as• Vectors of term-based features
• Features: tied to occurrence of terms in collection
• E.g.
• Solution 1: Binary features: t=1 if presence, 0 otherwise• Similarity: number of terms in common
• Dot product
),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj tttqtttd
ji
N
i
kijk ttdqsim ,
1
,),(
Tarjni Vyas 16
17
The Vector Model:
Example I
k1 k2 k3 q dj
d1 1 0 1 2
d2 1 0 0 1
d3 0 1 1 2
d4 1 0 0 1
d5 1 1 1 3
d6 1 1 0 2
d7 0 1 0 1
q 1 1 1
d1
d2
d3d4 d5
d6d7
k1k2
k3
Tarjni Vyas
Vector Space Model II
• Problem: Not all terms equally interesting• E.g. “accuracy” vs “crime”
• Solution: Replace binary term features with weights• Document collection: term-by-document matrix
• View as vector in multidimensional space• Nearby vectors are related
• Normalize for vector length
),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj wwwqwwwd
Tarjni Vyas 18
Cosine similarity
t 1
d 2
d 1
t 3
t 2
θ
• Distance between vectors d1 and d2 captured by the cosine of the angle x between them.
Tarjni Vyas 19
Queries in the vector space model
Central idea: the query as a vector:
• We regard the query as short document
• Note that dq is very sparse!
• We return the documents ranked by the closeness of their vectors to the query, also represented as a vector.
n
i qi
n
i ji
n
i qiji
qj
qj
qj
ww
ww
dd
ddddsim
1
2
,1
2
,
1 ,,),(
Tarjni Vyas 20
Vector Similarity Computation
• Similarity = Dot product
• Normalization:• Normalize weights in advance
• Normalize post-hoc
ji
N
i
kijkjk wwdqdqsim ,
1
,),(
N
i ji
N
i ki
N
i jiki
jk
ww
wwdqsim
1
2
,1
2
,
1 ,,),(
• Cosine of angle between two vectors
• The denominator involves the lengths of the vectors.Tarjni Vyas 21
Computation of weights wij and wiq
• How to compute the weights wij and wiq ?• quantification of intra-document content (similarity/semantic emphasis)
• tf factor, the term frequency within a document
• quantification of inter-document separation (dis-similarity/significant discriminant)• idf factor, the inverse document frequency
• wij = tf(i,j) * idf(i)
Tarjni Vyas 24
Weighting scheme
• Let,• N be the total number of docs in the collection
• ni be the number of docs which contain ki
• freq(i,j) raw frequency of ki within dj
• A normalized tf factor is given by• f(i,j) = freq(i,j) / max(freq(l,j))
• where the maximum is computed over all terms which occur within the document dj
• The idf factor is computed as• idf(i) = log (N/ni)
• the log makes the values of tf and idf comparable.
Tarjni Vyas 25
Rules:
• WARNING: In a lot of IR literature, “frequency” is used to mean “count”• Thus term frequency in IR literature is used to mean number of occurrences in
a doc
• Not divided by document length (which would actually make it a frequency)
Tarjni Vyas 26
Best weighting scheme
• The best term-weighting schemes use weights which are given by • wij = f(i,j) * log(N/ni)
• the strategy is called a tf-idf weighting scheme
• For the query term weights, use• wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)
• The vector model with tf-idf weights is a good ranking strategy for general collections. • It is also simple and fast to compute.
Tarjni Vyas 27
28
The Vector Model:
Example IId1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q dj
d1 1 0 1 4
d2 1 0 0 1
d3 0 1 1 5
d4 1 0 0 1
d5 1 1 1 6
d6 1 1 0 3
d7 0 1 0 2
q 1 2 3
Tarjni Vyas
29
The Vector Model:
Example III d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q dj
d1 2 0 1 5
d2 1 0 0 1
d3 0 1 3 11
d4 2 0 0 2
d5 1 2 4 17
d6 1 2 0 5
d7 0 5 0 10
q 1 2 3
Tarjni Vyas
We now consider the query “best auto carinsurance” on a fictitious collection with N =1,000,000 documents where the documentfrequencies of auto, best, car and insurance arerespectively 5000, 50000, 10000 and 1000.
Tarjni Vyas 30
net score of 0+ 0 + 0.82 + 2.46 = 3.28
Tarjni Vyas 31
Example 1(inverted index)
• Draw the inverted index that would be built for the following document collection.
• Doc 1 new home sales top forecasts
• Doc 2 home sales rise in july
• Doc 3 increase in home sales in july
• Doc 4 july new home sales rise
• Hint : i)arranging ii) sorting iii) merging
Tarjni Vyas 32
Example 2(Boolean model)
• Consider these documents:
• Doc 1 breakthrough drug for schizophrenia
• Doc 2 new schizophrenia drug
• Doc 3 new approach for treatment of schizophrenia
• Doc 4 new hopes for schizophrenia patients
• For the document collection, Use and depict the Boolean model and what are the Returned results for these queries:
a. schizophrenia AND drug
Tarjni Vyas 33
Example 3(Weighted zone scoring)
• Score according to the zone of the document.
• Consider the query shakespeare in a collection in which each document has threezones: author, title and body.
• The Boolean score function for a zone takes on the value 1 if the query termshakespeare is present in the zone, and zero otherwise.
• Weighted zone scoring in such a collection would require three weights g1, g2and g3, respectively corresponding to the author, title and body zones.
• Suppose we set g1 = 0.2, g2 = 0.3 and g3 = 0.5 (so that the three weights add upto 1); this corresponds to an application in which a match in the author zone isleast important to the overall score, the title zone somewhat more, and the bodycontributes even more.
• Thus if the term shakespeare were to appear in the title and body zones but notthe author zone of a document, the score of this document would be 0.8.
Tarjni Vyas 34
Example (vector model)
• Q : “gold silver truck”
• D1 : “shipment of gold damaged in a fire”
• D2 : “delivery of silver arrived in a silver truck”
• D3 : “Shipment of gold in a truck”
• Find the ranking of the document using vector space model.
Tarjni Vyas 35
Algorithm for computing vector scores
• We now initiate the study of determining the K documents with the highest vector space scores for a query.
• Typically, we seek these K top documents in ordered by decreasing score.
• for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.
• Here we give the basic algorithm for this computation
Tarjni Vyas 36
Algorithm for computing vector scores
• COSINESCORE(q)
1 float Scores[N] = 0
2 Initialize Length[N]
3 for each query term t
4 do calculate wt,q and fetch postings list for t
5 for each pair(d, tft,d) in postings list
6 do Scores[d] += wft,d × wt,q
7 Read the array Length[d]
8 for each d
9 do Scores[d] = Scores[d]/Length[d]
10 return Top K components of Scores[]
Tarjni Vyas 37
• The array Length - the lengths for each of the N documents
• the array Scores - the scores for each of the documents.
Tarjni Vyas 38
Advantages and disadvantages
• Advantages:• term-weighting improves answer set quality
• partial matching allows retrieval of docs that approximate the query conditions
• cosine ranking formula sorts documents according to degree of similarity to the query
• Disadvantages:• assumes independence of index terms; not clear that this is bad though
Tarjni Vyas 39
Why use probabilities ?
• Information Retrieval deals with Uncertain Information
• Probability theory seems to be the most natural way to quantify uncertainty
Tarjni Vyas 40
goal
• Collection of Documents
• User issues a query
• A Set of documents needs to be returned
• Question: In what order to present documents to user ?
• Intuitively, want the “best” document to be first, second best -second, etc…
• Need a formal way to judge the “goodness” of documents w.r.t. queries.
• Idea: Probability of relevance of the document w.r.t. query
Tarjni Vyas 41
Probability Ranking Principle
If a reference retrieval system’s response to each request is a
ranking of the documents in the collections in order of
decreasing probability of usefulness to the user who
submitted the request ...
… where the probabilities are estimated as accurately a
possible on the basis of whatever data made available to
the system for this purpose ...
… then the overall effectiveness of the system to its users
will be the best that is obtainable on the basis of that data.
W.S. Cooper
Tarjni Vyas 42
Probability theory
• For two events A and B, the joint event of both eventsoccurring is described by the joint probability P(A, B).
• The conditional probability P(A|B) expresses theprobability of event A given that event B occurred.
• The fundamental relationship between joint andconditional probabilities is given by the chain rule:
Tarjni Vyas 43
Let us remember Probability Theory
Let a, b be two events.
)()|()()|(
)(
)()|()|(
)()|()()()|(
apabpbpbap
bp
apabpbap
apabpbapbpbap
Tarjni Vyas 44
Probability Ranking Principle
Let x be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed)
query and let NR represent non-relevance.
)(
)()|()|(
)(
)()|()|(
xp
NRpNRxpxNRp
xp
RpRxpxRp
p(x|R), p(x|NR) - probability that if a relevant (non-relevant)
document is retrieved, it is x.
Need to find p(R|x) - probability that a retrieved document x
is relevant.p(R),p(NR) - prior probability
of retrieving a (non) relevant
document
Tarjni Vyas 45
Probability Ranking Principle
)(
)()|()|(
)(
)()|()|(
xp
NRpNRxpxNRp
xp
RpRxpxRp
Ranking Principle (Bayes’ Decision Rule):
If p(R|x) > p(NR|x) then x is relevant,
otherwise x is not relevant
Tarjni Vyas 46
Some methods using some rules
ABOUT RELEVANT AND NON-RELEVANT DOCUMENT
• I1 :Distribution of the terms in relevant document is independent and distribution in all documents is independent.
• Presence of one term of the document doesn’t assure the presence of another terms they are independent and random.
• I2 :Distribution of the terms in relevant document is independent and distribution in non- relevant documents is independent.
• Satisfies I1.
• Query = “ A B C”
• A doesn’t assure presence or absence of B in the documents.
Tarjni Vyas 47
Methods and assumptions
• O1 :probable relevance is based on the presence of search terms in thedocument.
this says that document should be ranked only when some terms arematching in the document.
Evidence must be found.
• O2 :probable relevance is based on both the presence of search terms inthe document and their absence from the document.
O2 doesn’t mean that we don’t know anything it just means that we arehaving some evidence for non-relevance.
So O1 and O2 means that we should consider presence and absence of allsearch terms in the query.
Tarjni Vyas 48
Combination of the methods using probability
• N = number of documents in the collection • R = number of relevant documents for a given query q• n = number of documents indexed by a given term t• r = number of relevant documents indexed by the given term t
• Choosing I1 and O1 for following weight• W1 = log( (r/R) / (n/N))r/R – relevant document ration / N – depicts the importance of the term in the query if it is too much then if decreases the overall weight that is w1.
Tarjni Vyas 49
Combine method
• I2 and O1• W2 = log( (r/R) / ((n-r) / (N-R)) )• (n-r) –(total indexed docs for term–actual relevant doc for term)=NR docs
for term• (N-R) – (total docs – no of relevant doc)=NR for query=total NR docs• Combine I1 and O2• W3 = log((r/R-r) / (n / N – n))• (N-n) =total doc – doc indexed by term = which are not indexed=better high • (R-r) = relevant doc query – number of relevant docs term =better low• Combine I2 and O2 • W4 = log ( (r/R-r) / ((n-r) / (N-n)-(R-r)) )• R-r = relevant doc – number of relevant docs
Tarjni Vyas 50
Example for weight calculation
• Q : “gold silver truck”
• D1 : “shipment of gold damaged in a fire”
• D2 : “delivery of silver arrived in a silver truck”
• D3 : “Shipment of gold in a truck”
• Use probabilistic model and find out the appropriate rank of the document.
• Since we are using this procedure in a predictive manner ,Robertson andsparrck jones recommended adding constant to each quantity
• Add constant to r ,R,n and N 0.5,1,1 and 2 respectively and then calculatew1.
Tarjni Vyas 51
Exampleapply the formulas find w1,w2,w3 and w4 for term and document.
gold silver truck
N 3 3 3
n 2 1 2
R 2 2 2
r 1 1 2
N = number ofdocuments in thecollectionR = number ofrelevant documentsfor a given query qn = number ofdocuments indexedby a given term tr = number ofrelevant documentsindexed by the giventerm tTarjni Vyas 52
Modified weight formulas
• W1 = log[(r+0.5)/(R1+1)]/[(n+1)/(N+2)]
• W2 = log[(r+0.5)/(R+1)] / [(n-r+0.5) / (N-R+1)]
• W3 = log[(r+0.5)/(R-r+0.5)]/[(n+1)/(N-n+1)]
• W4 = log[(r+0.5)/[(R-r+0.5)]/[(n-r+0.5)]/[(N-n-(R-r)+0.5)]
Tarjni Vyas 53
Term weights and document weights
Doc weights W1 W2 W3 W4
D1 -0.079 -0.0176 -0.0176 -0.477
D2 0.240 0.824 0.699 1.653
D3 0.064 0.347 0.347 0.699
Term weights W1 W2 W3 W4
GOLD -0.079 -0.0176 -0.0176 -0.477
SILVER 0.097 0.301 0.176 0.477
TRUCK 0.143 0.523 0.523 1.176
Tarjni Vyas 54
Language model
• Traditional model of a language, of the kind familiar GENERATIVEMODEL from formal language theory, can be used either to recognizeor to generate strings.
• The full set of strings that can be generated is called the language ofthe automaton.
Tarjni Vyas 57
Language model for query
hot
dog
restaurantin
city
Starting state
Accepting state
Tarjni Vyas 58
Example
Suppose, now, that we have two language models M1 and M2, shown below. find the probability estimate a sequence of “frog said that toad likes frog” for continuing the word we have the probability of 0.8 and stopping probability is 0.2.
Tarjni Vyas 59
Example
• To find the probability of a word sequence, we just multiply theprobabilities which the model gives to each word in the sequence,together with the probability of continuing or stopping afterproducing each word.
• P(frog said that toad likes frog) = (0.01 ×0.03 ×0.04 × 0.01 × 0.02 ×0.01) ×(0.8 ×0.8 × 0.8 × 0.8 × 0.8 × 0.8 × 0.2)
• ≈ 0.000000000001573
• The first line of numbers are the term emission probabilities, and thesecond line gives the probability of continuing or stopping aftergenerating each word.
Tarjni Vyas 60
Types of language models
• The simplest form of language model simply throws away allconditioning context, and estimates each term independently. Such amodel is called a UNIGRAM LANGUAGE MODEL.
Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)
• Such a model places a probability distribution over any sequence ofwords.
Tarjni Vyas 61
Another types of models
• There are many more complex kinds of language models, such asbigram MODEL which condition on the previous term,
Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)
• the probability of a sequence of events into the probability of eachsuccessive event conditioned on earlier events:
P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)
Tarjni Vyas 62
Exercise
• Can you find out the examples of each type of the language models?• Unigram model• Bigram model• Successive model
• Read the following and prepare the report.
• Relevant document retrieval via discrete stochastic optimization
• urlhttp://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6716603&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6716603
Tarjni Vyas 63
Expercise
• An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval
• url
http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4039288&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4039288
Tarjni Vyas 64