chapter 21 information retrieval. 2 what is information retrieval (ir)? “information retrieval...
TRANSCRIPT
Chapter 21
Information Retrieval
2
What is Information Retrieval (IR)?
“Information retrieval deals with the representation, storage, and access to (unstructured) documents or
representatives of documents (document surrogates)” (Gerard Salton)
Introduction
“The location and presentation to a user of information relevant to an information need expressed as a query”
(Robert R. Korfhage)
3
Primary goal of an IR system:
“Retrieve all the documents which are relevant to a user query, while retrieving as few non-relevant
documents as possible.”
Introduction
4
Introduction Goals of information retrieval (IR):
Retrieve information which are relevant to the user’s need
Represent and organize information for easy access
Translate user information need into queries that can be processed by IR systems
Information Retrieval vs. Data Retrieval: Information Retrieval: deal with natural language text which are
unstructured and semantically ambiguous
Data Retrieval: deals with well-defined and structured data
IR systems rank the content of each document in a collection according to a degree of relevance to the user’s query
(need)
5
Information Retrieval vs. Data Retrieval
Design Issues Data Retrieval Information Retrieval
Matching
Model
Classification Approach
Query language
Query specification
Items wanted
Error response
Data representation
Exact Match Partial (Best) Match
Deterministic Probabilistic
Monotonic Polytechnic
Artificial Natural
Complete Incomplete
Matching Relevant
Sensitive Insensitive
Schema (Mostly) Index terms
6
Information Retrieval Topics to be addressed (in this chapter)
Relevance Ranking Using Terms/Keywords
Relevance Ranking Using Hyperlinks
Synonyms, Homonyms, and Ontologies
Indexing of Documents
Measuring Retrieval Effectiveness
Web Search Engines
Information Retrieval and Structured Data
Directories
7
Information Retrieval Systems Information retrieval (IR) systems use a simpler data model
than database systems
Information organized as a collection of documents
Documents are unstructured, no schema
Information retrieval locates relevant documents, on the basis of user input such as keywords/example documents
e.g., find docs containing the words “database systems”
Can be used even on textual descriptions provided with non-textual data, such as images
Web search engines are the most familiar example of IR systems
8
Information Retrieval Systems (Cont.) Differences from database systems
IR systems don’t deal with transactional updates, concurrency control, data recovery, etc.
IR systems deal with some querying issues not generally addressed by database systems
• Approximate searching by keywords
• Ranking of retrieved answers by estimated degree of relevance
Database systems deal with structured data, with schemas that define the data organization, which does not apply to IR
9
Keyword Search In full text retrieval, all the words (i.e., terms) in each doc are
considered to be keywords.
IR systems typically allow query expressions formed using keywords & the logical connectives and (implicit), or, and not
Ranking of documents on the basis of estimated relevance to a query is critical
Relevance ranking is based on factors such as• Term frequency (TF) of occurrence of query keyword in docs &
Inverse document frequency (IDF) How many documents the query keyword occurs in
– Fewer documents give more importance to the keyword
• Hyperlinks to documents More links to a document the document is more important
10
Relevance Ranking Using Terms TF-IDF (Term Frequency-Inverse Document Frequency)
ranking: Let n(d) = number of terms in document d n(d, t) = number of occurrences of term t in d Relevance of a document d to a term t is determined by
• The log factor is to avoid excessive weight to frequent terms
• log2 (X) > 0, if X > 1
Relevance of document d to query Q
n(d)n(d, t)
1 +TF(d, t) = log2
TF(d, t) = n(d, t) / max(n(d, l)),or where l is a term in document d
or r(d, Q) = TF(d, t) IDF(t), where IDF(t) = log2(N/n(t)), and N is the total number of documents in a collection
t Q
r(d, Q) = TF(d, t)n(t)t Q
, where n(t) = number of documents with term t in a collection of documents
11
Relevance Ranking Using Terms (Cont.) Most systems add to the above model
Words that occur in title, author list, section headings, etc., are given greater importance
Words whose first occurrence is late in the document are given lower importance
Very common words, called stop words, such as “a”, “an”, “the”, “it”, etc., are eliminated
Proximity: if keywords in a query occur close together in the document, the document has higher importance
than if they occur far apart
Documents are returned in decreasing order of relevance score
Usually only top few documents are returned, not all
12
Similarity Based Retrieval Similarity based retrieval - retrieve documents similar to a
given document Similarity may be defined on the basis of common words, e.g., find k
terms in document D with highest TF(d, t)/n(t) and use these terms to find relevance of other documents.
Relevance feedback: Similarity can be used to refine answer set to keyword query
User selects a few relevant documents from those retrieved by keyword query and system finds other similar documents
Vector space model: define an n-dimensional space, where n is the number of words in the document set
Vector for document d goes from origin to a point whose ith coordinate is TF(d, ti)/n(ti)
The cosine of the angle between the vectors of two documents (or a query and a doc) is used as a measure of their similarity
13
The Vector (Space) Model
r(dj, Q) = Sim(q, dj) = cos() = [dj q] / |dj | |q|
= [ti=1 wij wiq ] / ( t
i=1 wij2 t
i=1 wiq2 )
where is the inner product operator, |q| (|dj |, respectively) is the length of q (dj, respectively), wij
= TF(dj, i) IDF(i), and
wiq = 0.5 + [0.5 (TF(q, i) / max(TF(q, l)))] log2(N / n(i)),
where N is the total number of documents and n(i) is the
number of documents with keyword i
Since wij 0 and wiq 0, 1 sim(q, dj ) 0
dj
q
14
d1
d2
d3
d4 d5
d6
d7
k1
k2
k3
The Vector (Space) ModelExample.
d1 2 0 1 5
111
217
5
10
d2 1 0 0
d3 0 1 3
d4 2 0 0
d5 1 2 4
d6 1 2 0
d7 0 5 0
q 1 2 3
Doc k1 k2 k3 q ● dj |q| |dj| |q||dj| Sim(q, dj)
14 5 14 5 0.6
14
14
141414
14
1 14 1
0.2710 14 10 0.93
4 14 2
0.2721 14
21 0.99
5 14 5 0.6
25 14 5
0.54
Sim(q, dj) = (dj q) / (|dj | |q|)= (t
i=1 wij wiq) / ( t
i=1 wij2 t
i=1 wiq2 )
15
Measuring Retrieval Effectiveness IR systems save space by using index structures that
support only approximate retrieval. May result in: False negatives (false drops) - relevant documents that
are not retrieved
False positives - non-relevant documents were retrieved
For many applications a good index should not permit any false drops, but may permit a few false positives; neither one is desirable in an IR system
Relevant performance metrics: Precision - what percentage of the retrieved documents
are relevant to the query, i.e., #Relevant / #Retrieved
Recall - what percentage of the documents relevant to the query were retrieved, i.e., #Retrieved Relevant / #Relevant
16
Precision versus Recall Let R be the set of relevant documents, and let |R| be the number
of documents in R for a given query.
Let A be the document answer set, and let |A| be the number of documents in A.
Let Ra = R A, and |Ra| be the number of documents in Ra.
Recall = , fraction of relevant documents retreived
Precisioin = , fraction of retrieved docs that are relevant
False Positives =
False Negatives =
|Ra||R|
|Ra||A|
CollectionRelevant Docs |R|
Answer Set |A|
Relevant DocsIn Answer Set |Ra|
|A| - |Ra|
|R| - |Ra|
17
Measuring Retrieval Effectiveness (Cont.) Recall vs. precision tradeoff:
Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many non-
relevant documents would be fetched, reducing precision
Measures of retrieval effectiveness, i.e., precision-vs-recall Recall as a function of number of documents fetched, or Precision as a function of recall, i.e., number of docs fetched
Problem: which documents are actually relevant, and which are not; researchers have created collections of (manually
tagged) documents and queries for measures
0
20
4060
80100
20 40 60 80 100
Prec
isio
n
Recall
••
• • •• • • • •
•
30 5010 70 90
1. d123 6. d9 11. d38
2. d84 7. d511 12. d48
3. d56 8. d129 13. d250
4. d6 9. d187 14. d113
5. d8 10. d25 15. d3
Let |R| = 1066%
50%40%
33%
100%
18
Relevance Using Hyperlinks Number of documents relevant to a query can be enormous if
only term frequencies are taken into account (in a large collection of documents, e.g., the Web)
Using term frequencies makes “spamming” easy e.g., a travel agency can add many occurrences of the words
“travel” to its page to make its rank very high
Most of the time people are looking for Web pages from popular sites (using hyperlinks)
Idea: use popularity of Web site (e.g., how many people visit it, or Web pages point to it) to rank site pages that match given keywords
Problem: hard to find actual popularity of site
19
Relevance Using Hyperlinks (Cont.) Solution: use the number of hyperlinks to a site/page as a measure
of the popularity or prestige of the site/page
Refinements
Combine with TF-IDF based measures with the popularity of a page P to compile the overall degree relevance of P
In computing prestige based on links to a site/page, give more weight to links from sites/pages that have higher prestige
• A transfer-of-prestige approach
• Definition is circular, i.e., popularity of a page is defined by the popularity of other pages and there may be cycles of links
• The circular problem can be solved by a system of simultaneous linear equations
Basis of the Google PageRank ranking mechanism
• the popularity of page P is based on the popularity of links to P
20
Relevance Using Hyperlinks (Cont.) Google PageRank ranking mechanism:
Using a random walk (traversal) model
• Step 1. Starts at a random Web page
• Step 2 and onward. Jumps to a randomly chosen
i. Web page with probably , or
ii. an outgoing link from the current Web page with probably (1 - )
Pages pointed to from many Web pages (or from a high Page Rank) are more likely to be visited and have higher PageRank
PageRank, P[j], for each Web page j is defined as
P[j] = ( / N) + (1 - ) (T[i, j] P[i])
where
N
i=1
Step 2(i) Step 2(ii)
0 1,N = number of pages, T[i, j] = 1/Ni
P[i] = 1/N (to begin with) (Ni = number of links out of page i),
21
Relevance Using Hyperlinks (Cont.) Connections to social networking theories that ranked prestige of people
e.g., the president of the U.S.A has a high prestige, since many people know him
Someone known by many prestigious people has high prestige
Hub and authority based ranking A hub is a page that stores links to many related pages (on a topic)
An authority is a page that contains actual information on a topic
A page gets a high hub prestige if it points to many pages with high authority-prestige
A page gets a high authority prestige if it is pointed to by many pages with high hub-prestige
Again, prestige definitions are cyclic, and can be got by solving linear equations
Use authority prestige when ranking answers to a query
22
Synonyms and Homonyms Synonyms
e.g., document: “motorcycle repair”; query: “motorcycle maintenance”• need to realize that “maintenance” and “repair” are synonyms
System can extend query as “motorcycle (repair maintenance)”
Homonyms
e.g., “object” has different meanings as noun/verb
disambiguate meanings (to some extent) from the context
Extending queries automatically using synonyms can be problematic
Synonyms may have other meanings as well
Must understand intended meaning in order to infer synonyms
• Or verify synonyms, e.g., program, with the user
23
Indexing of Documents An inverted index maps each keyword Ki to a set of documents Si
that contain the keyword Documents identified by identifiers
Inverted index may record Keyword locations in document to allow proximity based ranking
Counts of number of occurrences of keyword to compute TF
AND operation: Finds documents that contain all of K1, K2, ..., Kn using intersection, i.e., S1 S2 ... Sn
OR operation: documents that contain at least one of K1, K2, …, Kn using union, i.e., S1 S2 ... Sn
Each Si is kept sorted to allow efficient intersection/union by merging
not can be implemented by merging of sorted lists
24
Concept-Based Querying Approach
For each word, determine the concept it represents from context (e.g., disambiguate each word sometimes
by looking at other surrounding words in the same page)
Use one or more ontologies:• Hierarchical structure showing relationship between concepts• e.g., the ISA relationship that we saw in the E-R model
This approach can be used to standardize terminology in a specific field, but comes with very high
overhead in handing billions of documents; not good for search engines
Ontologies can link multiple languages
Foundation of the Semantic Web (not covered here)