ir theory - knu

IR Theory:IR Basics & Overview of IR Models

IR Approach

Search Engine 2

Information Seeker Authors

Information Need Concepts

Query String Documents

Is the document relevant to the query?

IR System Architecture

Search Engine 3

DocumentsQuery

Results

RepresentationModule


Matching/RankingModule

DocumentRepresentation

Query Representation

IR Step 1: Representation

Search Engine 4

DocumentsQuery

Results






How to represent text? How do we represent the complexities of language?

Computers don’t “understand” documents or queries

Simple, yet effective approach: “bag of words” Treat all the words in a document as index terms for that document Disregard order, structure, meaning, etc. of the words

Search Engine 5

McDonald's slims down spudsFast-food chain to reduce certain types of fat in its french fries with new cooking oil.NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as it moves to make all its fried menu items healthier.But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA.But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, but at least one nutrition expert says playing with the formula could mean a different taste.Shares of Oak Brook, Ill.-based McDonald's (MCD: down $0.54 to $23.22, Research, Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment.…

16 × said 14 × McDonalds12 × fat11 × fries8 × new6 × company french nutrition5 × food oil percent reduce

taste Tuesday…

Bag of Words

Bag-of-Word Representation

Search Engine 6

The quick brown fox jumped over the lazy dog’s back.

Document 1

Document 2

Now is the time for all good men to come to the aid of their party.

the

isfor

to

of

quick

brown

fox

over

lazy

dog

back

now

time

all

good

men

come

jump

aid

their

party

00110110110010100

11001001001101011

Term Docu

men

t 1

Docu

men

t 2

StopwordList

IR Step 2: Term Weighting

Search Engine 7

DocumentsQuery

Results






Term Weight: What & How?

What is term weight? Numerical estimate of term importance

How should we estimate the term importance? Terms that appear often in a document should get high weights

• The more often a document contains the term “dog”, the more likely that the document is “about” dogs.

Terms that appear in many documents should get low weights• Words like “the”, “a”, “of” appear in (nearly) all documents.

Term frequency in long documents should count less than those in short ones

How do we compute it? Term frequency (tf) Inverse document frequency (idf) Document length (dl)

Search Engine 8

IR Step 3: Matching & Ranking

Search Engine 9

DocumentsQuery

Results






IR Models Boolean Model ← Boolean Logic + Set Theory

Query: logical expression of terms (e.g., a AND b) Document: a set of terms Search Result → a set of documents satisfying query expression

Vector Space Model Document & query as vector of terms Search Result → documents ranked by query-document similarity

Probabilistic Model In practice, similar to VSM except using probabilistic term weight Search Result → documents ranked by probability of relevance

Search Engine 10

Boolean Model: Overview

Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document

Build queries by combining terms with Boolean operators AND, OR, NOT

The system returns all documents that satisfy the query

Search Engine 11

A B

A OR B

A AND B

A AND NOT(B)

Boolean Model: Strength Boolean operators define the relationship between query terms.

AND → terms/concepts that are not equivalent/similar• party AND good: good party• Retrieves records that include all AND terms → Narrows the search

OR → related terms, synonyms• party AND (good OR excellent OR wild): good party, excellent party, wild party• Retrieves records that include any OR terms → Broadens the search

NOT → antonyms, alternate terms for polysemes• party NOT republican: Republican party• Eliminates records that include NOT term → Narrows the search

Precise, if you know the right strategies knows what concepts to combine/exclude, narrow/broaden

Efficient for the computer

Search Engine 12

Boolean Model: Weakness

Natural language is way more complex Boolean logic is insufficient to capture the richness of language AND “discovers” nonexistent relationships

• Terms in different sentences, paragraphs, …→ Money is good, but I won’t be party to stealing.

Guessing terminology for OR is hard• good, nice, excellent, outstanding, awesome, …

Guessing terms to exclude (NOT) is even harder!• Republican party, party to a lawsuit, …

No control over the size of result set Too many documents or none All documents in the result set are considered “equally good”

No Partial Matching Documents that “don’t quite match” the query may also be useful.

Search Engine 13

Vector Space Model: Overview

Documents are represented as vectors of terms. Query is also represented as a term vector. Documents are ranked by their similarity to the query.

Similarity = cosine of the angles between document & query vectors

Vector space is n-dimensional where n = number of terms in the collection Terms form orthogonal vectors along the axes↔ Term independence assumption

Search Engine 14

Vector Space Model: Representation

“Bags of words” can be represented as vectors Computational efficiency Ease of manipulation Geometric metaphor: “arrows”

A vector is a set of values recorded in any consistent order

Search Engine 15

“The quick brown fox jumped over the lazy dog’s back”

→ (1, 1, 1, 1, 1, 1, 1, 1, 2)

1st position corresponds to “back”2nd position corresponds to “brown”3rd position corresponds to “dog”4th position corresponds to “fox”5th position corresponds to “jump”6th position corresponds to “lazy”7th position corresponds to “over”8th position corresponds to “quick”9th position corresponds to “the”

back 1

brown 1

dog 1

fox 1

jump 1

lazy 1

over 1

quick 1

the 2

Bag of wordsVector

Vector Space Model: Retrieval

Order documents by “relevance” Relevance = how likely they are to be relevant to the information need Some documents are “better” than others Users can decide when to stop reading

Best (partial) match Documents need not have all query terms Documents with more query terms should be “better”

Estimate relevance with query-document similarity1. Treat the query as if it were a document

• Create a query bag-of-words• Compute term weights

2. Find its similarity to each document3. Rank order the documents by similarity

• Works surprisingly well

Search Engine 16

Vector Space Model: 3-D Example

Search Engine17

y

x

z

A vector A in a 3-dimensional space• Represented with initial point at the origin of a rectangular coordinate system. • Projections of A on the x, y, and z axes: Ax, Ay, and Az

− the (rectangular) components of A in the x, y, and z directions− each axis represents a term (e.g., x = all, y = brown, z = cat)

A

Ax

Ay

Az

Vector Space Model: Postulate

Search Engine 18

Documents that are “close together” in vector space “talk about” the same things

t2

d2

d1

d3

d4

d5

t3

t1

θ

Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Vector Space Model: Example

Search Engine 19

t2

d3d1

d2

q

t3

t1

Query: What is information retrieval?Q: Information 1, retrieval 1

Index Term d1 d2 d3

t1 (information) 1 1 1

t2 (retrieval) 1 2 0

t3 (seminar) 1 1 1

D1: Information retrieval seminarsD2: Retrieval seminars and Information RetrievalD3: Information seminar

d3 q

√𝟐𝟐 √𝟐𝟐

√𝟐𝟐

qd1

√𝟐𝟐√𝟑𝟑

𝟏𝟏

q

d2

√𝟐𝟐

√𝟔𝟔

√𝟐𝟐

θ = 60

θ ≈ 35

θ ≈ 30

Vector Space Model: Pros & Cons

Pros Non-binary term weights Partial matching Ranked results Easy query formulation Query expansion

Cons Term relationships ignored Term order ignored No wildcard Problematic w/ long

documents

Similarity ≠ Relevance

Search Engine 20

Boolean vs. Vector Space Model Boolean model

Based on the notion of sets• Does not impose a ranking on retrieved documents

Documents are retrieved only if they satisfy Boolean conditions specified in the query • Exact match

Vector space model Based on geometry, the notion of vectors in high dimensional space Documents are ranked based on their similarity to the query Best/partial match

Search Engine 21

Probabilistic Model: Retrieval

Probability Ranking Principle (Robertson, 1977)

Assumptions• Each document is either relevant or not relevant

Ranking by document’s probability of relevance to the query→ Maximizes the overall retrieval effectiveness

Binary Independence Retrieval Model Rank documents (binary term vector d’s) by relevance odds

• P(R|d) = probability that document d is relevant• ti = ith term of n terms (1 if it occurs in d, 0 otherwise)• pi = probability that ti occurs in a relevant document• qi = probability that ti occurs in a non-relevant document

Search Engine 22

𝑂𝑂 𝑅𝑅 𝑑𝑑 =𝑃𝑃(𝑅𝑅|𝑑𝑑)𝑃𝑃( �𝑅𝑅|𝑑𝑑)

= �𝑖𝑖=1

𝑛𝑛

𝑡𝑡𝑖𝑖 × log𝑝𝑝𝑖𝑖(1 − 𝑞𝑞𝑖𝑖)𝑞𝑞𝑖𝑖(1 − 𝑝𝑝𝑖𝑖)

Click to see Derivation

https://scholar.google.co.kr/scholar?hl=en&as_sdt=0%2C5&q=probability+ranking+principle+robertson&btnG=

Probabilistic Model: Term Weight

Term Relevance Weight Term weight based on Binary Independence Retrieval Model

→ leverage relevance information to weight terms

• trk = Term Relevance Weight of term k• O(pk) = odds that term k appears in a relevant document• O(qk) = odds that term k appears in a non-relevant document

Probability estimation without relevance information• 𝑞𝑞𝑘𝑘 ≈ �𝑑𝑑𝑘𝑘

𝑁𝑁𝑑𝑑 (assume #non-rel docs = collection size)

• 𝑝𝑝𝑘𝑘 ≈ 0.5 (assume random chance) qk = probability that term k occurs in a non-relevant document

pk = probability that term k occurs in a relevant document

dk = number of documents in which term k occurs

Nd = number of documents in the collection (i.e., collection size)

Search Engine 23

𝑡𝑡𝑡𝑡𝑘𝑘 = log 𝑝𝑝𝑘𝑘(1−𝑞𝑞𝑘𝑘)𝑞𝑞𝑘𝑘(1−𝑝𝑝𝑘𝑘)

=log 𝑂𝑂(𝑝𝑝𝑘𝑘)𝑂𝑂(𝑞𝑞𝑘𝑘)

≅ log𝑁𝑁𝑑𝑑−𝑑𝑑𝑘𝑘𝑑𝑑𝑘𝑘

Click to see Derivation

Appendix

Search Engine 24

25

Binary Independence Retrieval Model

)|()|(

)()(

RdPRdP

RPRP

⋅= )(RO=

==otherwise

dtobelongstiftwherettd iin ,0

,1),,...,( 1

)|()|()|(

dRPdRPdRO

=

)()|()(

)()|()(

dPRdPRP

dPRdPRP

=)|()|(

1 RtPRtP

i

in

i=∏

apply Bayes Theorem assume Term Independence

P(R) = probability that a (randomly selected) document is relevantP(d) = probability that document d is observedP(R|d) = probability that document d is relevantP(d|R) = probability of observing document d from the set of relevant documents R

= probability that a (randomly selected) document from R is document d

split by presence/absence of terms

)|0()|0(

)|1()|1()(

01 RtPRtP

RtPRtPRO

i

i

ti

i

t ii ==

∏==

∏===

Ignore , which is constant across documents, and take the log:

)|1( RtPp ii ==)|1( RtPq ii ==

i

i

ti

i

t qp

qpROdRO

ii −−

∏∏=== 1

1)()|(01

ii

ii

iit

i

i

t

i

in

ii

i

in

i

ii

in

i

ti

ti

ti

ti

n

i qp

qpRO

tifqp

tifqp

qqppRO

−

=

=

=

−

−

=

−−

∏=

=−−

∏

=∏⇐

−−

∏=1

1

1

1

1

1

1 11)(

0,)1()1(

1,

)1()1()(

Let : probability that ti occurs in a relevant document: probability that ti occurs in a non-relevant document

)()()(

RPRPRO =

∑=

−−

=

−−

=

−−

∏=

n

i

t

i

i

t

i

i

t

i

i

t

i

in

i

iiii

qp

qp

qp

qpdRSV

1

11

1 11log

11log)( ∑∑

==

−−

+−−

−=

−−

−+=n

i i

i

i

ii

i

ii

n

i i

ii

i

ii q

pqpt

qpt

qpt

qpt

11 11log

11loglog

11log)1(log

∑ ∑∑= == −

−+

−−

=

−−

+

−−

=n

i

n

i i

i

ii

iii

n

i i

i

i

i

i

i

i qp

pqqpt

qp

qp

qp

t1 11 1

1log)1()1(log

11log

11log Ignore ,

which is constant across documents

∑= −

−n

i i

i

qp

1 11log

∑==

=+++==

∏

n

iinni

n

ixxxxxxxx

12121

1loglog..loglog)..log(log

Search Engine

26

Term Relevance Weight

∑= −

−=

n

i ii

iii pq

qptdRSV1 )1(

)1(log)(

trk = Log odds ratio

Probability Estimation for Term Relevance Weight Without relevance information

• Known→ Nd = total number of documents in collection (collection size)→ dk = number of documents in which term k appears (postings)

• Estimated→ qk = probability that term k occurs in a non-relevant document

qk ≈ dk / Nd (conventional) qk = (dk+0.5) / (Nd+1) assume that number of non-relevant documents ≈ collection size

→ pk = probability that term k occurs in a relevant document pk ≈ 0.5 assume random chance

• Same as idf for large Nd

)()(log

)1(log

)1(log)1(log

)1(log

)1()1(log

k

k

k

k

k

k

k

k

k

k

kk

kkk qO

pOq

qp

pq

qp

ppqqptr =

−−

−=

−+

−=

−−

=

− documentrelevantnonainappearsktermthatodds

documentrelevantainappearsktermthatoddslog

k

kd

d

k

d

kd

d

k

d

k

k

k

k

k

kk

kkk d

dN

NdN

dN

Nd

Nd

qq

pp

pqqptr −

=

−

+=−

+−

=−

+−

=−−

= loglog01

log5.01

5.0log)1(log)1(

log)1()1(log

5.05.0log

5.0)5.0(1log

15.015.01

log0)1(log)1(

log)1()1(log:

++−

=+

+−+=

++

++−

+=−

+−

=−−

=k

kd

k

kd

d

k

d

k

k

k

k

k

kk

kkk d

dNd

dN

Nd

Nd

qq

pp

pqqptralConvention

Search Engine