Download - Intro to Information Retrieval
![Page 1: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/1.jpg)
Intro to Information Retrieval
By the end of the lecture you should be able to: explain the differences between database
and information retrieval technologies describe the basic maths underlying set-
theoretic and vector models of classical IR.
![Page 2: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/2.jpg)
Reminder: efficiency is vital Reminder: Google finds documents which
match your keywords; this must be done EFFICIENTLY – cant just go through each document from start to end for each keyword
So, cache stores copy of document, and also a “cut-down” version of the document for searching: just a “bag of words”, a sorted list (or array/vector/…) of words appearing in the document (with links back to full document)
Try to match keywords against this list; if found, then return the full document
Even cleverer: dictionary and inverted file…
![Page 3: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/3.jpg)
Inverted file structure
Term 1 (2)
Term 2 (3)
Term 3 (1)
Term 4 (3)
Term 5 (4)..
1
2
1
2
3
2
2
3
4..
Doc 1
Doc2
Doc3
Doc4
Doc5
Doc6..
1
3
6
7
9..
dictionary Inverted or postings file Data file
![Page 4: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/4.jpg)
IR vs DBMS
DBMS IR
match exact partial or best match inference deduction induction model deterministic probabilistic data record/field text document query language artificial natural? query specification
complete incomplete
items wanted matching relevant error response sensitive insensitive
![Page 5: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/5.jpg)
informal introduction
IR was developed for bibliographic systems. We shall refer to ‘documents’, but the technique extends beyond items of text.
central to IR is representation of a document by a set of ‘descriptors’ or ‘index terms’ (“words in the document”).
searching for a document is carried out (mainly) in the ‘space’ of index terms.
we need a language for formulating queries, and a method for matching queries with document descriptors.
![Page 6: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/6.jpg)
architecture
user
Query matching
Learning component
Object base
(objects and their descriptions)
hits
query
feedback
![Page 7: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/7.jpg)
basic notation
Given a list of m documents, D, and a list of n index
terms, T, we define wi,j to be a weight associated with
the ith keyword and the jth document.
For the jth document, we define an index term vector, dj :
dj = (w1,j , w2,j , …., wn,j )
For example: D = { d1, d2, d3},
T = {pudding, jam, traffic, lane, treacle}
d1 = (1, 1, 0, 0, 0),
d2 = (0, 0, 1, 1, 0),
d3 = (1, 1, 1, 1, 0)
Recipe for jam pudding
DoT report on traffic lanes
Radio item on traffic jam in Pudding Lane
![Page 8: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/8.jpg)
set theoretic, Boolean model Queries are Boolean expressions formed using
keywords, eg:(‘Jam’ V ‘Treacle’) Λ’Pudding’ Λ¬ ‘Lane’ Λ¬ ‘Traffic’
Query is re-expressed in disjunctive normal form (DNF)
eg (1, 1, 0, 0, 0) V (1, 0, 0, 0, 1) V (1, 1, 0, 0, 1)To match a document with a query: sim(d, qDNF) = 1 if d is equal to a component of
qDNF
= 0 otherwise
CF: T = {pudding, jam, traffic, lane, treacle}
![Page 9: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/9.jpg)
d1 = (1, 1, 0, 0, 0),
d2 = (0, 0, 1, 1, 0),
d3 = (1, 1, 1, 1, 0)
(1, 1, 0, 0, 0) V (1, 0, 0, 0, 1) V (1, 1, 0, 0, 1)
T = {pudding, jam, traffic, lane, treacle}
pudding
jam
trafficlane
treacle
![Page 10: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/10.jpg)
collecting resultsT = {pudding, jam, traffic, lane, treacle}
Answer: d1 = (1, 1, 0, 0, 0) Jam pud recipe
Query:(‘Jam’ V ‘Treacle’) Λ’Pudding’ Λ¬ ‘Lane’Λ¬ ‘Traffic’(jam V treacle) Λ (pudding)
Λ –(Lane) Λ –(Traffic)
pudding
jam
trafficlane
treacle
![Page 11: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/11.jpg)
Statistical vector model
weights, 1 <= wi,j <= 0, no longer binary-valued query also represented by a vector
q = (w1q, w2q, …, wnq)– eg q = (1.0, 0.6, 0.0, 0.0, 0.8)
CF: T = {pudding, jam, traffic, lane, treacle}to match jth document with a query:
sim(dj, q) =
Σi=1(wij × wiq)
n
Σi=1 Σi=1wij 2 × wiq
2n n
![Page 12: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/12.jpg)
T1
T2
D1
Q
w11
w1q
w21 w2q
wiq 2
Σi=1(wij × wiq)
n
Σi=1 Σi=1wij 2 ×
n n
= cos()
Cosine coefficient
![Page 13: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/13.jpg)
T1
T2
D1
Q
w11
w1q
w21
w2q
wiq 2
Σi=1(wij × wiq)
n
Σi=1 Σi=1wij 2 ×
n n
= cos(0)
= 1
=0
Cosine coefficient
![Page 14: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/14.jpg)
T1
T2
D1
Q
w11
w1q= 0
w21= 0w2q
wiq 2
Σi=1(wij × wiq)
n
Σi=1 Σi=1wij 2 ×
n n
= cos(90º)
= 0
= 90º
Cosine coefficient
![Page 15: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/15.jpg)
Σi=1(wij × wiq)
n
Σi=1wij 2 n
Σi=1wiq
2n
q = (1.0, 0.6, 0.0, 0.0, 0.8)
Σi=1(wij × wiq)
n
Σi=1 Σi=1wij 2 × wiq
2n n
d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe
= 0.8×1.0 + 0.8×0.6 + 0.0×0.0 + 0.0×0.0 + 0.2×0.8
= 1.44= 0.82 + 0.82 + 0.02 + 0.02 + 0.22 = 1.32= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
= 1.44 = 0.89
1.32 × 2.0
![Page 16: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/16.jpg)
Σi=1(wij × wiq)
n
Σi=1wij 2 n
Σi=1wiq
2n
q = (1.0, 0.6, 0.0, 0.0, 0.8)
Σi=1(wij × wiq)
n
Σi=1 Σi=1wij 2 × wiq
2n n
d2 = (0.0, 0.0, 0.9, 0.8, 0), DoT Report
= 0.0×1.0 + 0.0×0.6 + 0.9×0.0 + 0.8×0.0 + 0.0×0.8
= 0.0= 0.02 + 0.02 + 0.92 + 0.82 + 0.02 = 1.45= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
= 0.0 = 0.0
1.45 × 2.0
![Page 17: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/17.jpg)
Σi=1(wij × wiq)
n
Σi=1wij 2 n
Σi=1wiq
2n
q = (1.0, 0.6, 0.0, 0.0, 0.8)
Σi=1(wij × wiq)
n
Σi=1 Σi=1wij 2 × wiq
2n n
d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic Report
= 0.6×1.0 + 0.9×0.6 + 1.0×0.0 + 0.6×0.0 + 0.0×0.8
= 1.14= 0.62 + 0.92 + 1.02 + 0.62 + 0.02 = 2.53= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
= 1.14 = 0.51
2.53 × 2.0
![Page 18: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/18.jpg)
q = (1.0, 0.6, 0.0, 0.0, 0.8)
2. d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic (0.51) Report
collecting results
1. d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe (0.89)
Rank document vector document (sim)
CF: T = {pudding, jam, traffic, lane, treacle}
![Page 19: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/19.jpg)
Discussion: Set theoretic model
Boolean model is simple, queries have precise semantics, but it is an ‘exact match’ model, and does not Rank results
Boolean model popular with bibliographic systems; available on some search engines
Users find Boolean queries hard to formulate Attempts to use set theoretic model as basis
for a partial-match system: Fuzzy set model and the extended Boolean model.
![Page 20: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/20.jpg)
Discussion: Vector Model
Vector model is simple, fast and results show leads to ‘good’ results.
Partial matching leads to ranked output Popular model with search engines Underlying assumption of term independence
(not realistic! Phrases, collocations, grammar) Generalised vector space model relaxes the
assumption that index terms are pairwise orthogonal (but is more complicated).
![Page 21: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/21.jpg)
questions raised
Where do the index terms come from? (ALL the words in the source documents?)
What determines the weights? How well can we expect these systems to
work for practical applications? How can we improve them? How do we integrate IR into more traditional
DB management?
![Page 22: Intro to Information Retrieval](https://reader035.vdocuments.us/reader035/viewer/2022081603/56813655550346895d9ddd5f/html5/thumbnails/22.jpg)
Questions to think about
Why is traditional database unsuited to retrieval of unstructured information?
How would you re-express a Boolean query, eg (A or B or (C and not D)), in disjunctive normal form?
For the matching coefficient, sim(., .) show that 0 <= sim(., .) <= 1, and that sim(a, a) = 1.
Compare and contrast the ‘vector’ and ‘set theoretic’ models in terms of power of representation of documents and queries.