ir models - tarjni.files.wordpress.com · list of docs match abstract tarjni vyas 2. ir models...

IR MODELSBY Prof.TARJNI VYAS

Tarjni Vyas 1

IntroductionDocs DB

Information Need

Index Terms

Doc

Query

Ranked

List of

Docs

matchabstract

Tarjni Vyas 2

IR Models

Non-Overlapping Lists

Proximal Nodes

Structured Models

Retrieval:

Adhoc

Filtering

Browsing

U

s

e

r

T

a

s

k

Classic Models

boolean

vector

probabilistic

Set Theoretic

Fuzzy

Extended Boolean

Probabilistic

Inference Network

Belief Network

Algebraic

Generalized Vector

Lat. Semantic Index

Neural Networks

Browsing

Flat

Structure Guided

Hypertext

Tarjni Vyas 3

Specifying an IR Model

• Structure Quadruple [D, Q, F, R(qi, dj)]• D = Representation of documents

• Q = Representation of queries

• F = Framework for modeling representations and their relationships• Standard language/algebra/impl. type for translation to provide semantics

• Evaluation w.r.t. “direct” semantics through benchmarks

• R = Ranking function that associates a real number with a query-doc pair

Tarjni Vyas 4

About index terms

• Each document represented by a set of representative keywords orindex terms• Index terms meant to capture document’s main themes or semantics.

• Usually, index terms are nouns because nouns have meaning by themselves.

• However, search engines assume that all words are index terms (full textrepresentation)

• T1 = “conference”

• T2 = “crime”

• Adjectives, adverbs, conjunction, etc not useful.

Tarjni Vyas 5

Notations/Conventions

• Ki is an index term

• dj is a document

• t is the total number of docs

• K = (k1, k2, …, kt) is the set of all index terms

• wij >= 0 is the weight associated with (ki,dj)• wij = 0 if the term is not in the doc

• vec(dj) = (w1j, w2j, …, wtj) is the weight vector associated with the document dj

• gi(vec(dj)) = wij is the function which returns the weight associated with the pair (ki,dj)

Tarjni Vyas 7

The Boolean Model

• Simple model based on set theory

• Queries and documents specified as boolean expressions • precise semantics

• E.g., q = ka (kb kc)

• Terms are either present or absent. Thus, wij {0,1}

Tarjni Vyas 8

Example

• q = ka (kb kc)

• vec(qdnf) = (1,1,1) (1,1,0) (1,0,0)• Disjunctive Normal Form

• vec(qcc) = (1,1,0) • Conjunctive component

• Similar/Matching documents• md1 = [ka ka d e] => (1,0,0)

• md2 = [ka kb kc] => (1,1,1)

• Unmatched documents• ud1 = [ka kc] => (1,0,1)

• ud2 = [d] => (0,0,0)

Tarjni Vyas 9

Similarity/Matching function

sim(q,dj) = 1 if vec(dj) vec(qdnf))

0 otherwise

• Requires coercion for accuracy

Tarjni Vyas 10

Venn Diagram

q = ka (kb kc)

(1,1,1)(1,0,0)

(1,1,0)

Ka Kb

Kc

Tarjni Vyas 11

Drawback of Boolean model

• Expressive power of boolean expressions to capture information need and document semantics inadequate

• Retrieval based on binary decision criteria (with no partial match) does not reflect our intuitions behind relevance adequately

• As a result• Answer set contains either too few or too many documents in response to a

user query

• No ranking of documents

Tarjni Vyas 12

Vector Model• Task:

• Document collection

• Query specifies information need: free text

• Relevance judgments: depends upon the weighting scheame for all docs

• Word evidence: Bag of words• No ordering information

Tarjni Vyas 14

Vector Space Model

• Represent documents and queries as• Vectors of term-based features

• Features: tied to occurrence of terms in collection

• E.g.

• Solution 1: Binary features: t=1 if presence, 0 otherwise• Similarity: number of terms in common

• Dot product

),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj tttqtttd

ji

N

i

kijk ttdqsim ,

1

,),(

Tarjni Vyas 16

17

The Vector Model:

Example I

k1 k2 k3 q dj

d1 1 0 1 2

d2 1 0 0 1

d3 0 1 1 2

d4 1 0 0 1

d5 1 1 1 3

d6 1 1 0 2

d7 0 1 0 1

q 1 1 1

d1

d2

d3d4 d5

d6d7

k1k2

k3

Tarjni Vyas

Vector Space Model II

• Problem: Not all terms equally interesting• E.g. “accuracy” vs “crime”

• Solution: Replace binary term features with weights• Document collection: term-by-document matrix

• View as vector in multidimensional space• Nearby vectors are related

• Normalize for vector length

),...,,();,...,,( ,,2,1,,2,1 kNkkkjNjjj wwwqwwwd

Tarjni Vyas 18

Cosine similarity

t 1

d 2

d 1

t 3

t 2

θ

• Distance between vectors d1 and d2 captured by the cosine of the angle x between them.

Tarjni Vyas 19

Queries in the vector space model

Central idea: the query as a vector:

• We regard the query as short document

• Note that dq is very sparse!

• We return the documents ranked by the closeness of their vectors to the query, also represented as a vector.

n

i qi

n

i ji

n

i qiji

qj

qj

qj

ww

ww

dd

ddddsim

1

2

,1

2

,

1 ,,),(

Tarjni Vyas 20

Vector Similarity Computation

• Similarity = Dot product

• Normalization:• Normalize weights in advance

• Normalize post-hoc

ji

N

i

kijkjk wwdqdqsim ,

1

,),(

N

i ji

N

i ki

N

i jiki

jk

ww

wwdqsim

1

2

,1

2

,

1 ,,),(

• Cosine of angle between two vectors

• The denominator involves the lengths of the vectors.Tarjni Vyas 21

Computation of weights wij and wiq

• How to compute the weights wij and wiq ?• quantification of intra-document content (similarity/semantic emphasis)

• tf factor, the term frequency within a document

• quantification of inter-document separation (dis-similarity/significant discriminant)• idf factor, the inverse document frequency

• wij = tf(i,j) * idf(i)

Tarjni Vyas 24

Weighting scheme

• Let,• N be the total number of docs in the collection

• ni be the number of docs which contain ki

• freq(i,j) raw frequency of ki within dj

• A normalized tf factor is given by• f(i,j) = freq(i,j) / max(freq(l,j))

• where the maximum is computed over all terms which occur within the document dj

• The idf factor is computed as• idf(i) = log (N/ni)

• the log makes the values of tf and idf comparable.

Tarjni Vyas 25

Rules:

• WARNING: In a lot of IR literature, “frequency” is used to mean “count”• Thus term frequency in IR literature is used to mean number of occurrences in

a doc

• Not divided by document length (which would actually make it a frequency)

Tarjni Vyas 26

Best weighting scheme

• The best term-weighting schemes use weights which are given by • wij = f(i,j) * log(N/ni)

• the strategy is called a tf-idf weighting scheme

• For the query term weights, use• wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni)

• The vector model with tf-idf weights is a good ranking strategy for general collections. • It is also simple and fast to compute.

Tarjni Vyas 27

28

The Vector Model:

Example IId1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q dj

d1 1 0 1 4

d2 1 0 0 1

d3 0 1 1 5

d4 1 0 0 1

d5 1 1 1 6

d6 1 1 0 3

d7 0 1 0 2

q 1 2 3

Tarjni Vyas

29

The Vector Model:

Example III d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q dj

d1 2 0 1 5

d2 1 0 0 1

d3 0 1 3 11

d4 2 0 0 2

d5 1 2 4 17

d6 1 2 0 5

d7 0 5 0 10

q 1 2 3

Tarjni Vyas

We now consider the query “best auto carinsurance” on a fictitious collection with N =1,000,000 documents where the documentfrequencies of auto, best, car and insurance arerespectively 5000, 50000, 10000 and 1000.

Tarjni Vyas 30

net score of 0+ 0 + 0.82 + 2.46 = 3.28

Tarjni Vyas 31

Example 1(inverted index)

• Draw the inverted index that would be built for the following document collection.

• Doc 1 new home sales top forecasts

• Doc 2 home sales rise in july

• Doc 3 increase in home sales in july

• Doc 4 july new home sales rise

• Hint : i)arranging ii) sorting iii) merging

Tarjni Vyas 32

Example 2(Boolean model)

• Consider these documents:

• Doc 1 breakthrough drug for schizophrenia

• Doc 2 new schizophrenia drug

• Doc 3 new approach for treatment of schizophrenia

• Doc 4 new hopes for schizophrenia patients

• For the document collection, Use and depict the Boolean model and what are the Returned results for these queries:

a. schizophrenia AND drug

Tarjni Vyas 33

Example 3(Weighted zone scoring)

• Score according to the zone of the document.

• Consider the query shakespeare in a collection in which each document has threezones: author, title and body.

• The Boolean score function for a zone takes on the value 1 if the query termshakespeare is present in the zone, and zero otherwise.

• Weighted zone scoring in such a collection would require three weights g1, g2and g3, respectively corresponding to the author, title and body zones.

• Suppose we set g1 = 0.2, g2 = 0.3 and g3 = 0.5 (so that the three weights add upto 1); this corresponds to an application in which a match in the author zone isleast important to the overall score, the title zone somewhat more, and the bodycontributes even more.

• Thus if the term shakespeare were to appear in the title and body zones but notthe author zone of a document, the score of this document would be 0.8.

Tarjni Vyas 34

Example (vector model)

• Q : “gold silver truck”

• D1 : “shipment of gold damaged in a fire”

• D2 : “delivery of silver arrived in a silver truck”

• D3 : “Shipment of gold in a truck”

• Find the ranking of the document using vector space model.

Tarjni Vyas 35

Algorithm for computing vector scores

• We now initiate the study of determining the K documents with the highest vector space scores for a query.

• Typically, we seek these K top documents in ordered by decreasing score.

• for instance many search engines use K = 10 to retrieve and rank-order the first page of the ten best results.

• Here we give the basic algorithm for this computation

Tarjni Vyas 36

Algorithm for computing vector scores

• COSINESCORE(q)

1 float Scores[N] = 0

2 Initialize Length[N]

3 for each query term t

4 do calculate wt,q and fetch postings list for t

5 for each pair(d, tft,d) in postings list

6 do Scores[d] += wft,d × wt,q

7 Read the array Length[d]

8 for each d

9 do Scores[d] = Scores[d]/Length[d]

10 return Top K components of Scores[]

Tarjni Vyas 37

• The array Length - the lengths for each of the N documents

• the array Scores - the scores for each of the documents.

Tarjni Vyas 38

Advantages and disadvantages

• Advantages:• term-weighting improves answer set quality

• partial matching allows retrieval of docs that approximate the query conditions

• cosine ranking formula sorts documents according to degree of similarity to the query

• Disadvantages:• assumes independence of index terms; not clear that this is bad though

Tarjni Vyas 39

Why use probabilities ?

• Information Retrieval deals with Uncertain Information

• Probability theory seems to be the most natural way to quantify uncertainty

Tarjni Vyas 40

goal

• Collection of Documents

• User issues a query

• A Set of documents needs to be returned

• Question: In what order to present documents to user ?

• Intuitively, want the “best” document to be first, second best -second, etc…

• Need a formal way to judge the “goodness” of documents w.r.t. queries.

• Idea: Probability of relevance of the document w.r.t. query

Tarjni Vyas 41

Probability Ranking Principle

If a reference retrieval system’s response to each request is a

ranking of the documents in the collections in order of

decreasing probability of usefulness to the user who

submitted the request ...

… where the probabilities are estimated as accurately a

possible on the basis of whatever data made available to

the system for this purpose ...

… then the overall effectiveness of the system to its users

will be the best that is obtainable on the basis of that data.

W.S. Cooper

Tarjni Vyas 42

Probability theory

• For two events A and B, the joint event of both eventsoccurring is described by the joint probability P(A, B).

• The conditional probability P(A|B) expresses theprobability of event A given that event B occurred.

• The fundamental relationship between joint andconditional probabilities is given by the chain rule:

Tarjni Vyas 43

Let us remember Probability Theory

Let a, b be two events.

)()|()()|(

)(

)()|()|(

)()|()()()|(

apabpbpbap

bp

apabpbap

apabpbapbpbap

Tarjni Vyas 44


Let x be a document in the collection.

Let R represent relevance of a document w.r.t. given (fixed)

query and let NR represent non-relevance.

)(

)()|()|(

)(

)()|()|(

xp

NRpNRxpxNRp

xp

RpRxpxRp

p(x|R), p(x|NR) - probability that if a relevant (non-relevant)

document is retrieved, it is x.

Need to find p(R|x) - probability that a retrieved document x

is relevant.p(R),p(NR) - prior probability

of retrieving a (non) relevant

document

Tarjni Vyas 45


)(

)()|()|(

)(

)()|()|(

xp

NRpNRxpxNRp

xp

RpRxpxRp

Ranking Principle (Bayes’ Decision Rule):

If p(R|x) > p(NR|x) then x is relevant,

otherwise x is not relevant

Tarjni Vyas 46

Some methods using some rules

ABOUT RELEVANT AND NON-RELEVANT DOCUMENT

• I1 :Distribution of the terms in relevant document is independent and distribution in all documents is independent.

• Presence of one term of the document doesn’t assure the presence of another terms they are independent and random.

• I2 :Distribution of the terms in relevant document is independent and distribution in non- relevant documents is independent.

• Satisfies I1.

• Query = “ A B C”

• A doesn’t assure presence or absence of B in the documents.

Tarjni Vyas 47

Methods and assumptions

• O1 :probable relevance is based on the presence of search terms in thedocument.

this says that document should be ranked only when some terms arematching in the document.

Evidence must be found.

• O2 :probable relevance is based on both the presence of search terms inthe document and their absence from the document.

O2 doesn’t mean that we don’t know anything it just means that we arehaving some evidence for non-relevance.

So O1 and O2 means that we should consider presence and absence of allsearch terms in the query.

Tarjni Vyas 48

Combination of the methods using probability

• N = number of documents in the collection • R = number of relevant documents for a given query q• n = number of documents indexed by a given term t• r = number of relevant documents indexed by the given term t

• Choosing I1 and O1 for following weight• W1 = log( (r/R) / (n/N))r/R – relevant document ration / N – depicts the importance of the term in the query if it is too much then if decreases the overall weight that is w1.

Tarjni Vyas 49

Combine method

• I2 and O1• W2 = log( (r/R) / ((n-r) / (N-R)) )• (n-r) –(total indexed docs for term–actual relevant doc for term)=NR docs

for term• (N-R) – (total docs – no of relevant doc)=NR for query=total NR docs• Combine I1 and O2• W3 = log((r/R-r) / (n / N – n))• (N-n) =total doc – doc indexed by term = which are not indexed=better high • (R-r) = relevant doc query – number of relevant docs term =better low• Combine I2 and O2 • W4 = log ( (r/R-r) / ((n-r) / (N-n)-(R-r)) )• R-r = relevant doc – number of relevant docs

Tarjni Vyas 50

Example for weight calculation

• Q : “gold silver truck”

• D1 : “shipment of gold damaged in a fire”

• D2 : “delivery of silver arrived in a silver truck”

• D3 : “Shipment of gold in a truck”

• Use probabilistic model and find out the appropriate rank of the document.

• Since we are using this procedure in a predictive manner ,Robertson andsparrck jones recommended adding constant to each quantity

• Add constant to r ,R,n and N 0.5,1,1 and 2 respectively and then calculatew1.

Tarjni Vyas 51

Exampleapply the formulas find w1,w2,w3 and w4 for term and document.

gold silver truck

N 3 3 3

n 2 1 2

R 2 2 2

r 1 1 2

N = number ofdocuments in thecollectionR = number ofrelevant documentsfor a given query qn = number ofdocuments indexedby a given term tr = number ofrelevant documentsindexed by the giventerm tTarjni Vyas 52

Modified weight formulas

• W1 = log[(r+0.5)/(R1+1)]/[(n+1)/(N+2)]

• W2 = log[(r+0.5)/(R+1)] / [(n-r+0.5) / (N-R+1)]

• W3 = log[(r+0.5)/(R-r+0.5)]/[(n+1)/(N-n+1)]

• W4 = log[(r+0.5)/[(R-r+0.5)]/[(n-r+0.5)]/[(N-n-(R-r)+0.5)]

Tarjni Vyas 53

Term weights and document weights

Doc weights W1 W2 W3 W4

D1 -0.079 -0.0176 -0.0176 -0.477

D2 0.240 0.824 0.699 1.653

D3 0.064 0.347 0.347 0.699

Term weights W1 W2 W3 W4

GOLD -0.079 -0.0176 -0.0176 -0.477

SILVER 0.097 0.301 0.176 0.477

TRUCK 0.143 0.523 0.523 1.176

Tarjni Vyas 54

Language model

• Traditional model of a language, of the kind familiar GENERATIVEMODEL from formal language theory, can be used either to recognizeor to generate strings.

• The full set of strings that can be generated is called the language ofthe automaton.

Tarjni Vyas 57

Language model for query

hot

dog

restaurantin

city

Starting state

Accepting state

Tarjni Vyas 58

Example

Suppose, now, that we have two language models M1 and M2, shown below. find the probability estimate a sequence of “frog said that toad likes frog” for continuing the word we have the probability of 0.8 and stopping probability is 0.2.

Tarjni Vyas 59

Example

• To find the probability of a word sequence, we just multiply theprobabilities which the model gives to each word in the sequence,together with the probability of continuing or stopping afterproducing each word.

• P(frog said that toad likes frog) = (0.01 ×0.03 ×0.04 × 0.01 × 0.02 ×0.01) ×(0.8 ×0.8 × 0.8 × 0.8 × 0.8 × 0.8 × 0.2)

• ≈ 0.000000000001573

• The first line of numbers are the term emission probabilities, and thesecond line gives the probability of continuing or stopping aftergenerating each word.

Tarjni Vyas 60

Types of language models

• The simplest form of language model simply throws away allconditioning context, and estimates each term independently. Such amodel is called a UNIGRAM LANGUAGE MODEL.

Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)

• Such a model places a probability distribution over any sequence ofwords.

Tarjni Vyas 61

Another types of models

• There are many more complex kinds of language models, such asbigram MODEL which condition on the previous term,

Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)

• the probability of a sequence of events into the probability of eachsuccessive event conditioned on earlier events:

P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)

Tarjni Vyas 62

Exercise

• Can you find out the examples of each type of the language models?• Unigram model• Bigram model• Successive model

• Read the following and prepare the report.

• Relevant document retrieval via discrete stochastic optimization

• urlhttp://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6716603&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6716603

Tarjni Vyas 63

Expercise

• An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval

• url

http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=4039288&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4039288

Tarjni Vyas 64