chapter 21 information retrieval. 2 what is information retrieval (ir)? “information retrieval...

Chapter 21

Information Retrieval

2

What is Information Retrieval (IR)?

“Information retrieval deals with the representation, storage, and access to (unstructured) documents or

representatives of documents (document surrogates)” (Gerard Salton)

Introduction

“The location and presentation to a user of information relevant to an information need expressed as a query”

(Robert R. Korfhage)

3

Primary goal of an IR system:

“Retrieve all the documents which are relevant to a user query, while retrieving as few non-relevant

documents as possible.”

Introduction

4

Introduction Goals of information retrieval (IR):

Retrieve information which are relevant to the user’s need

Represent and organize information for easy access

Translate user information need into queries that can be processed by IR systems

Information Retrieval vs. Data Retrieval: Information Retrieval: deal with natural language text which are

unstructured and semantically ambiguous

Data Retrieval: deals with well-defined and structured data

IR systems rank the content of each document in a collection according to a degree of relevance to the user’s query

(need)

5

Information Retrieval vs. Data Retrieval

Design Issues Data Retrieval Information Retrieval

Matching

Model

Classification Approach

Query language

Query specification

Items wanted

Error response

Data representation

Exact Match Partial (Best) Match

Deterministic Probabilistic

Monotonic Polytechnic

Artificial Natural

Complete Incomplete

Matching Relevant

Sensitive Insensitive

Schema (Mostly) Index terms

6

Information Retrieval Topics to be addressed (in this chapter)

Relevance Ranking Using Terms/Keywords

Relevance Ranking Using Hyperlinks

Synonyms, Homonyms, and Ontologies

Indexing of Documents

Measuring Retrieval Effectiveness

Web Search Engines

Information Retrieval and Structured Data

Directories

7

Information Retrieval Systems Information retrieval (IR) systems use a simpler data model

than database systems

Information organized as a collection of documents

Documents are unstructured, no schema

Information retrieval locates relevant documents, on the basis of user input such as keywords/example documents

e.g., find docs containing the words “database systems”

Can be used even on textual descriptions provided with non-textual data, such as images

Web search engines are the most familiar example of IR systems

8

Information Retrieval Systems (Cont.) Differences from database systems

IR systems don’t deal with transactional updates, concurrency control, data recovery, etc.

IR systems deal with some querying issues not generally addressed by database systems

• Approximate searching by keywords

• Ranking of retrieved answers by estimated degree of relevance

Database systems deal with structured data, with schemas that define the data organization, which does not apply to IR

9

Keyword Search In full text retrieval, all the words (i.e., terms) in each doc are

considered to be keywords.

IR systems typically allow query expressions formed using keywords & the logical connectives and (implicit), or, and not

Ranking of documents on the basis of estimated relevance to a query is critical

Relevance ranking is based on factors such as• Term frequency (TF) of occurrence of query keyword in docs &

Inverse document frequency (IDF) How many documents the query keyword occurs in

– Fewer documents give more importance to the keyword

• Hyperlinks to documents More links to a document the document is more important

10

Relevance Ranking Using Terms TF-IDF (Term Frequency-Inverse Document Frequency)

ranking: Let n(d) = number of terms in document d n(d, t) = number of occurrences of term t in d Relevance of a document d to a term t is determined by

• The log factor is to avoid excessive weight to frequent terms

• log2 (X) > 0, if X > 1

Relevance of document d to query Q

n(d)n(d, t)

1 +TF(d, t) = log2

TF(d, t) = n(d, t) / max(n(d, l)),or where l is a term in document d

or r(d, Q) = TF(d, t) IDF(t), where IDF(t) = log2(N/n(t)), and N is the total number of documents in a collection

t Q

r(d, Q) = TF(d, t)n(t)t Q

, where n(t) = number of documents with term t in a collection of documents

11

Relevance Ranking Using Terms (Cont.) Most systems add to the above model

Words that occur in title, author list, section headings, etc., are given greater importance

Words whose first occurrence is late in the document are given lower importance

Very common words, called stop words, such as “a”, “an”, “the”, “it”, etc., are eliminated

Proximity: if keywords in a query occur close together in the document, the document has higher importance

than if they occur far apart

Documents are returned in decreasing order of relevance score

Usually only top few documents are returned, not all

12

Similarity Based Retrieval Similarity based retrieval - retrieve documents similar to a

given document Similarity may be defined on the basis of common words, e.g., find k

terms in document D with highest TF(d, t)/n(t) and use these terms to find relevance of other documents.

Relevance feedback: Similarity can be used to refine answer set to keyword query

User selects a few relevant documents from those retrieved by keyword query and system finds other similar documents

Vector space model: define an n-dimensional space, where n is the number of words in the document set

Vector for document d goes from origin to a point whose ith coordinate is TF(d, ti)/n(ti)

The cosine of the angle between the vectors of two documents (or a query and a doc) is used as a measure of their similarity

13

The Vector (Space) Model

r(dj, Q) = Sim(q, dj) = cos() = [dj q] / |dj | |q|

= [ti=1 wij wiq ] / ( t

i=1 wij2 t

i=1 wiq2 )

where is the inner product operator, |q| (|dj |, respectively) is the length of q (dj, respectively), wij

= TF(dj, i) IDF(i), and

wiq = 0.5 + [0.5 (TF(q, i) / max(TF(q, l)))] log2(N / n(i)),

where N is the total number of documents and n(i) is the

number of documents with keyword i

Since wij 0 and wiq 0, 1 sim(q, dj ) 0

dj

q

14

d1

d2

d3

d4 d5

d6

d7

k1

k2

k3

The Vector (Space) ModelExample.

d1 2 0 1 5

111

217

5

10

d2 1 0 0

d3 0 1 3

d4 2 0 0

d5 1 2 4

d6 1 2 0

d7 0 5 0

q 1 2 3

Doc k1 k2 k3 q ● dj |q| |dj| |q||dj| Sim(q, dj)

14 5 14 5 0.6

14

14

141414

14

1 14 1

0.2710 14 10 0.93

4 14 2

0.2721 14

21 0.99

5 14 5 0.6

25 14 5

0.54

Sim(q, dj) = (dj q) / (|dj | |q|)= (t

i=1 wij wiq) / ( t

i=1 wij2 t

i=1 wiq2 )

15

Measuring Retrieval Effectiveness IR systems save space by using index structures that

support only approximate retrieval. May result in: False negatives (false drops) - relevant documents that

are not retrieved

False positives - non-relevant documents were retrieved

For many applications a good index should not permit any false drops, but may permit a few false positives; neither one is desirable in an IR system

Relevant performance metrics: Precision - what percentage of the retrieved documents

are relevant to the query, i.e., #Relevant / #Retrieved

Recall - what percentage of the documents relevant to the query were retrieved, i.e., #Retrieved Relevant / #Relevant

16

Precision versus Recall Let R be the set of relevant documents, and let |R| be the number

of documents in R for a given query.

Let A be the document answer set, and let |A| be the number of documents in A.

Let Ra = R A, and |Ra| be the number of documents in Ra.

Recall = , fraction of relevant documents retreived

Precisioin = , fraction of retrieved docs that are relevant

False Positives =

False Negatives =

|Ra||R|

|Ra||A|

CollectionRelevant Docs |R|

Answer Set |A|

Relevant DocsIn Answer Set |Ra|

|A| - |Ra|

|R| - |Ra|

17

Measuring Retrieval Effectiveness (Cont.) Recall vs. precision tradeoff:

Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many non-

relevant documents would be fetched, reducing precision

Measures of retrieval effectiveness, i.e., precision-vs-recall Recall as a function of number of documents fetched, or Precision as a function of recall, i.e., number of docs fetched

Problem: which documents are actually relevant, and which are not; researchers have created collections of (manually

tagged) documents and queries for measures

0

20

4060

80100

20 40 60 80 100

Prec

isio

n

Recall

••

• • •• • • • •

•

30 5010 70 90

1. d123 6. d9 11. d38

2. d84 7. d511 12. d48

3. d56 8. d129 13. d250

4. d6 9. d187 14. d113

5. d8 10. d25 15. d3

Let |R| = 1066%

50%40%

33%

100%

18

Relevance Using Hyperlinks Number of documents relevant to a query can be enormous if

only term frequencies are taken into account (in a large collection of documents, e.g., the Web)

Using term frequencies makes “spamming” easy e.g., a travel agency can add many occurrences of the words

“travel” to its page to make its rank very high

Most of the time people are looking for Web pages from popular sites (using hyperlinks)

Idea: use popularity of Web site (e.g., how many people visit it, or Web pages point to it) to rank site pages that match given keywords

Problem: hard to find actual popularity of site

19

Relevance Using Hyperlinks (Cont.) Solution: use the number of hyperlinks to a site/page as a measure

of the popularity or prestige of the site/page

Refinements

Combine with TF-IDF based measures with the popularity of a page P to compile the overall degree relevance of P

In computing prestige based on links to a site/page, give more weight to links from sites/pages that have higher prestige

• A transfer-of-prestige approach

• Definition is circular, i.e., popularity of a page is defined by the popularity of other pages and there may be cycles of links

• The circular problem can be solved by a system of simultaneous linear equations

Basis of the Google PageRank ranking mechanism

• the popularity of page P is based on the popularity of links to P

20

Relevance Using Hyperlinks (Cont.) Google PageRank ranking mechanism:

Using a random walk (traversal) model

• Step 1. Starts at a random Web page

• Step 2 and onward. Jumps to a randomly chosen

i. Web page with probably , or

ii. an outgoing link from the current Web page with probably (1 - )

Pages pointed to from many Web pages (or from a high Page Rank) are more likely to be visited and have higher PageRank

PageRank, P[j], for each Web page j is defined as

P[j] = ( / N) + (1 - ) (T[i, j] P[i])

where

N

i=1

Step 2(i) Step 2(ii)

0 1,N = number of pages, T[i, j] = 1/Ni

P[i] = 1/N (to begin with) (Ni = number of links out of page i),

21

Relevance Using Hyperlinks (Cont.) Connections to social networking theories that ranked prestige of people

e.g., the president of the U.S.A has a high prestige, since many people know him

Someone known by many prestigious people has high prestige

Hub and authority based ranking A hub is a page that stores links to many related pages (on a topic)

An authority is a page that contains actual information on a topic

A page gets a high hub prestige if it points to many pages with high authority-prestige

A page gets a high authority prestige if it is pointed to by many pages with high hub-prestige

Again, prestige definitions are cyclic, and can be got by solving linear equations

Use authority prestige when ranking answers to a query

22

Synonyms and Homonyms Synonyms

e.g., document: “motorcycle repair”; query: “motorcycle maintenance”• need to realize that “maintenance” and “repair” are synonyms

System can extend query as “motorcycle (repair maintenance)”

Homonyms

e.g., “object” has different meanings as noun/verb

disambiguate meanings (to some extent) from the context

Extending queries automatically using synonyms can be problematic

Synonyms may have other meanings as well

Must understand intended meaning in order to infer synonyms

• Or verify synonyms, e.g., program, with the user

23

Indexing of Documents An inverted index maps each keyword Ki to a set of documents Si

that contain the keyword Documents identified by identifiers

Inverted index may record Keyword locations in document to allow proximity based ranking

Counts of number of occurrences of keyword to compute TF

AND operation: Finds documents that contain all of K1, K2, ..., Kn using intersection, i.e., S1 S2 ... Sn

OR operation: documents that contain at least one of K1, K2, …, Kn using union, i.e., S1 S2 ... Sn

Each Si is kept sorted to allow efficient intersection/union by merging

not can be implemented by merging of sorted lists

24

Concept-Based Querying Approach

For each word, determine the concept it represents from context (e.g., disambiguate each word sometimes

by looking at other surrounding words in the same page)

Use one or more ontologies:• Hierarchical structure showing relationship between concepts• e.g., the ISA relationship that we saw in the E-R model

This approach can be used to standardize terminology in a specific field, but comes with very high

overhead in handing billions of documents; not good for search engines

Ontologies can link multiple languages

Foundation of the Semantic Web (not covered here)

chapter 21 information retrieval. 2 what is information retrieval (ir)? “information retrieval...

Documents

information retrieval

text retrieval

ambiguousdata retrieval

schemainformation retrieval

user information

information need

user of information

information retrievaltopics