search advertising, duplicate detection and revisionzhaojin/cs3245_2019/w13.pdfx k is the first...
TRANSCRIPT
Introduction to Information Retrieval
CS3245
Information Retrieval
Search Advertising Duplicate Detection and Revision
31
CS3245 ndash Information Retrieval
Last TimeChapter 20 Crawling
Chapter 21 Link Analysis
Ch 11-12
2
CS3245 ndash Information Retrieval
Today Exam Matters
Revision
Where to go from here
3
Chapter 19 Search Advertising Duplicate Detection
CS3245 ndash Information Retrieval
Web search overview
4
Ch 19
CS3245 ndash Information Retrieval
IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation
financing interest aggregation etc rarr look at search ads (Search Advertising)
The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)
5
Ch 19
CS3245 ndash Information Retrieval
Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create
content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from
it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays
for the web6
Ch 19
CS3245 ndash Information Retrieval
Interest aggregation Unique feature of the web A small number of
geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into
relatively portable C (open source project) Search engines are a key enabler for interest
aggregation
7
Ch 19
CS3245 ndash Information Retrieval
SEARCHADVERTISING
8
Sec 193
CS3245 ndash Information Retrieval
1st Generation of Search AdsGoto (1996)
9
Sec 193
CS3245 ndash Information Retrieval
1st Generation of search adsGoto (1996)
Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the
link Pages were simply ranked according to bid ndash revenue
maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking
but Goto did not pretend there was any10
Sec 193
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Last TimeChapter 20 Crawling
Chapter 21 Link Analysis
Ch 11-12
2
CS3245 ndash Information Retrieval
Today Exam Matters
Revision
Where to go from here
3
Chapter 19 Search Advertising Duplicate Detection
CS3245 ndash Information Retrieval
Web search overview
4
Ch 19
CS3245 ndash Information Retrieval
IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation
financing interest aggregation etc rarr look at search ads (Search Advertising)
The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)
5
Ch 19
CS3245 ndash Information Retrieval
Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create
content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from
it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays
for the web6
Ch 19
CS3245 ndash Information Retrieval
Interest aggregation Unique feature of the web A small number of
geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into
relatively portable C (open source project) Search engines are a key enabler for interest
aggregation
7
Ch 19
CS3245 ndash Information Retrieval
SEARCHADVERTISING
8
Sec 193
CS3245 ndash Information Retrieval
1st Generation of Search AdsGoto (1996)
9
Sec 193
CS3245 ndash Information Retrieval
1st Generation of search adsGoto (1996)
Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the
link Pages were simply ranked according to bid ndash revenue
maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking
but Goto did not pretend there was any10
Sec 193
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Today Exam Matters
Revision
Where to go from here
3
Chapter 19 Search Advertising Duplicate Detection
CS3245 ndash Information Retrieval
Web search overview
4
Ch 19
CS3245 ndash Information Retrieval
IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation
financing interest aggregation etc rarr look at search ads (Search Advertising)
The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)
5
Ch 19
CS3245 ndash Information Retrieval
Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create
content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from
it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays
for the web6
Ch 19
CS3245 ndash Information Retrieval
Interest aggregation Unique feature of the web A small number of
geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into
relatively portable C (open source project) Search engines are a key enabler for interest
aggregation
7
Ch 19
CS3245 ndash Information Retrieval
SEARCHADVERTISING
8
Sec 193
CS3245 ndash Information Retrieval
1st Generation of Search AdsGoto (1996)
9
Sec 193
CS3245 ndash Information Retrieval
1st Generation of search adsGoto (1996)
Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the
link Pages were simply ranked according to bid ndash revenue
maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking
but Goto did not pretend there was any10
Sec 193
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Web search overview
4
Ch 19
CS3245 ndash Information Retrieval
IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation
financing interest aggregation etc rarr look at search ads (Search Advertising)
The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)
5
Ch 19
CS3245 ndash Information Retrieval
Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create
content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from
it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays
for the web6
Ch 19
CS3245 ndash Information Retrieval
Interest aggregation Unique feature of the web A small number of
geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into
relatively portable C (open source project) Search engines are a key enabler for interest
aggregation
7
Ch 19
CS3245 ndash Information Retrieval
SEARCHADVERTISING
8
Sec 193
CS3245 ndash Information Retrieval
1st Generation of Search AdsGoto (1996)
9
Sec 193
CS3245 ndash Information Retrieval
1st Generation of search adsGoto (1996)
Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the
link Pages were simply ranked according to bid ndash revenue
maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking
but Goto did not pretend there was any10
Sec 193
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
IR on the web vs IR in general On the web search is not just a nice feature Search is a key enabler of the web content creation
financing interest aggregation etc rarr look at search ads (Search Advertising)
The web is a chaotic and uncoordinated document collection rarr lots of duplicates ndash need to detect duplicates (Duplicate Detection)
5
Ch 19
CS3245 ndash Information Retrieval
Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create
content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from
it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays
for the web6
Ch 19
CS3245 ndash Information Retrieval
Interest aggregation Unique feature of the web A small number of
geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into
relatively portable C (open source project) Search engines are a key enabler for interest
aggregation
7
Ch 19
CS3245 ndash Information Retrieval
SEARCHADVERTISING
8
Sec 193
CS3245 ndash Information Retrieval
1st Generation of Search AdsGoto (1996)
9
Sec 193
CS3245 ndash Information Retrieval
1st Generation of search adsGoto (1996)
Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the
link Pages were simply ranked according to bid ndash revenue
maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking
but Goto did not pretend there was any10
Sec 193
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Without search engines the web wouldnrsquot work Without search content is hard to find rarr Without search there is no incentive to create
content Why publish something if nobody will read it Why publish something if I donrsquot get ad revenue from
it Somebody needs to pay for the web Servers web infrastructure content creation A large part today is paid by search ads Search pays
for the web6
Ch 19
CS3245 ndash Information Retrieval
Interest aggregation Unique feature of the web A small number of
geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into
relatively portable C (open source project) Search engines are a key enabler for interest
aggregation
7
Ch 19
CS3245 ndash Information Retrieval
SEARCHADVERTISING
8
Sec 193
CS3245 ndash Information Retrieval
1st Generation of Search AdsGoto (1996)
9
Sec 193
CS3245 ndash Information Retrieval
1st Generation of search adsGoto (1996)
Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the
link Pages were simply ranked according to bid ndash revenue
maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking
but Goto did not pretend there was any10
Sec 193
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Interest aggregation Unique feature of the web A small number of
geographically dispersed people with similar interests can find each other Elementary school kids with hemophilia People interested in translating R5R5 Scheme into
relatively portable C (open source project) Search engines are a key enabler for interest
aggregation
7
Ch 19
CS3245 ndash Information Retrieval
SEARCHADVERTISING
8
Sec 193
CS3245 ndash Information Retrieval
1st Generation of Search AdsGoto (1996)
9
Sec 193
CS3245 ndash Information Retrieval
1st Generation of search adsGoto (1996)
Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the
link Pages were simply ranked according to bid ndash revenue
maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking
but Goto did not pretend there was any10
Sec 193
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
SEARCHADVERTISING
8
Sec 193
CS3245 ndash Information Retrieval
1st Generation of Search AdsGoto (1996)
9
Sec 193
CS3245 ndash Information Retrieval
1st Generation of search adsGoto (1996)
Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the
link Pages were simply ranked according to bid ndash revenue
maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking
but Goto did not pretend there was any10
Sec 193
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
1st Generation of Search AdsGoto (1996)
9
Sec 193
CS3245 ndash Information Retrieval
1st Generation of search adsGoto (1996)
Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the
link Pages were simply ranked according to bid ndash revenue
maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking
but Goto did not pretend there was any10
Sec 193
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
1st Generation of search adsGoto (1996)
Buddy Blake bid the maximum ($038) for this search He paid $038 to Goto every time somebody clicked on the
link Pages were simply ranked according to bid ndash revenue
maximization for Goto No separation of adsdocs Only one result list Upfront and honest No relevance ranking
but Goto did not pretend there was any10
Sec 193
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
2nd generation of search ads Google (2000)
11
SogoTrade appearsin searchresults
SogoTrade appearsin ads
Do search enginesrank advertisershigher thannon-advertisers
All major searchengines claim ldquonordquo
Sec 193
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Do ads influence editorial content Similar problem at newspapers TV channels A newspaper is reluctant to publish harsh criticism of its major
advertisers The line often gets blurred at newspapers on TV No known case of this happening with search engines yet
Leads to the job of white and black hat search engine optimization (organic) and search engine marketing(paid)
12
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
How are ads ranked Advertisers bid for keywords ndash sale by auction Open system Anybody can participate and bid on keywords Advertisers are only charged when somebody clicks on your
ad (ie CPC)
How does the auction determine an adrsquos rank and the price paidfor the ad
Basis is a second price auction but with twists For the bottom line this is perhaps the most important
research area for search engines ndash computational advertising Squeezing an additional fraction of a cent from each ad means
billions of additional revenue for the search engine
13
Sec 193
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
How are ads ranked First cut according to bid price ndash a la Goto Bad idea open to abuse Example query [husband cheating] rarr ad for divorce lawyer We donrsquot want to show nonrelevant ads
Instead rank based on bid price and relevance Key measure of ad relevance click-through rate = CTR =
clicks per impression Result A nonrelevant ad will be ranked low
Even if this decreases search engine revenue short-term Hope Overall acceptance of the system and overall revenue is
maximized if users get useful information
14
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Googlersquos second price auction
bid maximum bid for a click by advertiser CTR click-through rate when an ad is displayed what
percentage of time do users click on it CTR is a measure of relevance
ad rank bid times CTR this trades off (i) how much money the advertiser is willing to pay against (ii) how relevant the ad is
rank rank in auction paid second price auction price paid by advertiser
15
Sec 193
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Googlersquos second price auction
Second price auction The advertiser pays the minimum amount necessary to maintain their position in the auction (plus 1 cent) ndash related to the Vickrey Auction
price1 times CTR1 = bid2 times CTR2 (this will result in rank1=rank2)
price1 = bid2 times CTR2 CTR1
p1 = bid2 times CTR2CTR1 = 300 times 003006 = 150 p2 = bid3 times CTR3CTR2 = 100 times 008003 = 267 p3 = bid4 times CTR4CTR3 = 400 times 001008 = 050
16
Sec 193
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Keywords with high bids Top 10 Most Expensive Google Keywords Insurance Loans Mortgage Attorney Credit Lawyer Donate Degree Hosting Claim
17
Sec 193
httpwwwwordstreamcomarticlesmost-expensive-keywords
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Search ads A win-win-win The search engine company gets revenue every time
somebody clicks on an ad The user only clicks on an ad if they are interested in
the ad Search engines punish misleading and nonrelevant
ads As a result users are often satisfied with what they
find after clicking on an ad The advertiser finds new customers in a cost-
effective way
18
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Not a win-win-win Keyword arbitrage Buy a keyword on Google Then redirect traffic to a third party that is paying
much more than you are paying Google Eg redirect to a page full of ads
This rarely makes sense for the user
(Ad) spammers keep inventing new tricks The search engines need time to catch up with them Adversarial Information Retrieval
19
Sec 193
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Not a win-win-win Violation of trademarksgeico During part of 2005 The search term ldquogeicordquo on Google was bought by competitors Geico lost this case in the United States Louis Vuitton lost a similar case in Europe
httpsupportgooglecomadwordspolicyanswer6118rd=1Itrsquos potentially misleading to users to trigger an ad off of a trademark if the user canrsquot buy the product on the site
20
Sec 193
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
DUPLICATEDETECTION
21
Sec 196
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Duplicate detectionThe web is full of duplicated content More so than many other collections
Exact duplicates Easy to detect use hashfingerprint (eg MD5)
Near-duplicates More common on the web difficult to eliminate
For the user itrsquos annoying to get a search result with near-identical documents
Marginal relevance is zero even a highly relevant document becomes nonrelevant if it appears below a (near-)duplicate
22
Sec 196
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Near-duplicates Example
23
Sec 196
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Detecting near-duplicates Compute similarity with an edit-distance measure We want ldquosyntacticrdquo (as opposed to semantic) similarity True semantic similarity (similarity in content) is too difficult
to compute
We do not consider documents near-duplicates if they have the same content but express it with different words
Use similarity threshold θ to make the call ldquoisisnrsquot a near-duplicaterdquo
Eg two documents are near-duplicates if similarity gt θ = 80
24
Sec 196
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Recall Jaccard coefficient A commonly used measure of overlap of two sets Let A and B be two sets Jaccard coefficient
JACCARD(AA) = 1 JACCARD(AB) = 0 if A cap B = 0 A and B donrsquot have to be the same size Always assigns a number between 0 and 1
25
Sec 196
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Jaccard coefficient Example Three documents d1 ldquoJack London traveled to Oaklandrdquo
d2 ldquoJack London traveled to the city of Oaklandrdquo
d3 ldquoJack traveled from Oakland to Londonrdquo
Based on shingles of size 2 (2-grams or bigrams) what are the Jaccard coefficients J(d1 d2) and J(d1 d3)
J(d1 d2) = 38 = 0375 J(d1 d3) = 0 Note very sensitive to dissimilarity
26
Sec 196
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
A document as set of shingles A shingle is simply a word n-gram Shingles are used as features to measure syntactic
similarity of documents For example for n = 3 ldquoa rose is a rose is a roserdquo
would be represented as this set of shingles a-rose-is rose-is-a is-a-rose
We define the similarity of two documents as the Jaccard coefficient of their shingle sets
27
Sec 196
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Fingerprinting We can map shingles to a large integer space
[12m] (eg m = 64) by fingerprinting We use sk to refer to the shinglersquos fingerprint in 12m
This doesnrsquot directly help us ndash we are just converting strings to large integers
But it will help us compute an approximation to the actual Jaccard coefficient quickly
28
Sec 196
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Documents as sketches The number of shingles per document is large
difficult to exhaustively compare To make it fast we use a sketch a subset of the
shingles of a document The size of a sketch is say n = 200 and is defined by a
set of permutations π1 π200 Each πi is a random permutation on 12m
The sketch of d is defined as lt minsisind π1(s)minsisind π2(s) minsisind π200(s) gt
(a vector of 200 numbers)
29
Sec 196
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
30
Deriving a sketch element a permutation of the original hashes
document 1 sk document 2 sk
We use minsisind1 π(s) = minsisind2 π(s) as a test for are d1 and d2near-duplicates In this case permutation π says d1 asymp d2
Sec 196
30
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Proof that J(S(di)s(dj)) cong P(xi = xj)
Let C00 = of rows in A where both entries are 0 define C11 C10 C01 likewise
J(sisj) is then equivalent to C11 (C10 + C01 + C11) P(xi=xj) then is equivalent to C11 (C10 + C01 + C11)
31
0 1
1 0
1 1
0 0
1 1
0 1
0 1
0 0
1 1
1 0
1 1
0 1
1 1
0 0
0 1
1 1
1 0
0 1
1 0
0 1
0 1
1 1
1 1
0 0
0 0
1 1
0 1
1 1
0 1
1 0
hellip
A
J(sisj) = 25
Si Sj
π(1)Si Sj Si Sj Si Sj Si Sj
π(2) π(3) π(n=200)
We view a matrix A 1 column per set of hashes Element Aij = 1 if element i in
set Sj is present Permutation π(n) is a random
reordering of the rows in A xk is the first non-zero entry
in π(dk) ie first shingle present in document k
π π
π
xinexj xinexj xi=xjxi=xj
Sec 196
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Estimating Jaccard Thus the proportion of successful permutations is the
Jaccard coefficient Permutation π is successful iff minsisind1 π(s) = minsisind2 π(s)
Picking a permutation at random and outputting 1 (successful) or 0 (unsuccessful) is a Bernoulli trial
Estimator of probability of success proportion of successes in n Bernoulli trials (n = 200)
Our sketch is based on a random selection of permutations
Thus to compute Jaccard count the number k of successful permutations for lt d1 d2 gt and divide by n = 200
kn = k200 estimates J(d1 d2)32
Sec 196
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Shingling Summary Input N documents Choose n-gram size for shingling eg n = 5 Pick 200 random permutations represented as hash
functions Compute N sketches 200 times N matrix shown on
previous slide one row per permutation one column per document
Compute pairwise similarities Transitive closure of documents with similarity gt θ Index only one document from each equivalence
class33
Sec 196
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Summary Search Advertising A type of crowdsourcing Ask advertisers how much they
want to spend ask searchers how relevant an ad is Auction Mechanics
Duplicate Detection Represent documents as shingles Calculate an approximation to the Jaccard by using random
trials
34
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
EXAM MATTERS
35
Sec 196
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
36
Subscribe via NUS EduRec System by 20 May 2019
For more information
Get Exam Results earlier via SMS on 3 Jun 2019
httpsmyportalnusedusgstudentportalacademicsallscheduleofresultsreleaseht
ml
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Exam Format
37
Date 27 Apr (1pm)
Venue SR3 (COM1-02-12)
Open book
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Exam Topics Anything covered in lecture (through slides) and
corresponding sections in textbook
For sample questions refer to tutorials midterm and past year exam papers
If in doubt ask on the forum
38
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Exam Topic Distribution Focus on topics covered from Weeks 5 to 12
Q1 is truefalse questions on various topics (20 marks)
Q2-Q9 are calculation essay questions topic specific (8~12 marks each) Donrsquot forget your calculator
39
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
40
REVISION
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Week 1 Ngram Models A language model is created based on a collection of text and used to assign a probability to a sequence of words
Unigram LM Bag of words Ngram LM use n-1 tokens as the context to predict
the nth token Larger n-gram models more accurate but each increase in
order requires exponentially more space
LMs can be used to do information retrieval as introduced in Week 11
41
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Unigram LM with Add 1 smoothing Not used in practice but most basic to understand Idea add 1 count to all entries in the LM including
those that are not seen
eg assuming |V| = 11
Q2 (By Probability) ldquoI donrsquot wantrdquoP(Aerosmith) 11 11 11 = 13E-3P(LadyGaga) 15 05 15 = 11E-3Winner Aerosmith
42
I 2 eyes 2
donrsquot 2 your 1
want 2 love 1
to 2 and 1
close 2 revenge 1
my 2 Total Count 18
I 3 eyes 1
donrsquot 1 your 3
want 3 love 2
to 1 and 2
close 1 revenge 2
my 1 Total Count 20
Total of word types in both
lines
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Week 2 Basic (Boolean) IR Basic inverted indexes In memory dictionary and on disk postings
Key characteristic Sorted order for postings
Boolean query processing Intersection by linear time ldquomergingrdquo
Simple optimizations by expected size
43
Ch 1
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Indexer steps Dictionary amp Postings Multiple term
entries in a single document are merged
Split into Dictionary and Postings
Doc frequencyinformation is also stored
Why frequency
Sec 12
44
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
The merge Walk through the two postings simultaneously in
time linear in the total number of postings entries
34
1282 4 8 16 32 64
1 2 3 5 8 13 21
128
34
2 4 8 16 32 64
1 2 3 5 8 13 21
BrutusCaesar2 8
If the list lengths are x and y the merge takes O(x+y)operationsCrucial postings must be sorted by docID
Sec 13
45
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Week 3 Postings lists and Choosing terms Faster merging of posting lists Skip pointers
Handling of phrase and proximity queries Biword indexes for phrase queries Positional indexes for phraseproximity queries
Steps in choosing terms for the dictionary Text extraction Granularity of indexing Tokenization Stop word removal Normalization Lemmatization and stemming
46
Ch 2
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Adding skip pointers to postings
Done at indexing time Why To skip postings that will not figure in the
search results How to do it And where do we place skip pointers
47
1282 4 8 41 48 64
311 2 3 8 11 17 213111
41 128
Sec 23
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Biword indexes Index every consecutive pair of terms in the text as a
phrase bigram model using words For example the text ldquoFriends Romans
Countrymenrdquo would generate the biwords friends romans romans countrymen
Each of these biwords is now a dictionary term Two-word phrase query-processing is now
immediate
48
Sec 241
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Positional index example
For phrase queries we use a merge algorithm recursively at the document level
Now need to deal with more than just equality
49
ltbe 9934271 7 18 33 72 86 2312 3 1494 17 191 291 430 4345 363 367 hellipgt
Quick checkWhich of docs 1245could contain ldquoto be
or not to berdquo
Sec 242
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Choosing terms Definitely language specific
In English we worry about Tokenization Stop word removal Normalization (eg case folding) Lemmatization and stemming
50
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Data Structures for the Dictionary Hash Trees
Learning to be tolerant1 Wildcards General Trees Permuterm Ngrams redux
2 Spelling Correction Edit Distance Ngrams re-redux
3 Phonetic ndash Soundex
51
Week 4 The dictionary and tolerant retrieval
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Hash TableEach vocabulary term is hashed to an integer Pros Lookup is faster than for a tree O(1)
Cons No easy way to find minor variants
judgmentjudgement
No prefix search If vocabulary keeps growing need to occasionally do the
expensive operation of rehashing everything
52
Sec 31
Not very tolerant
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Wildcard queries How can we handle rsquos in the middle of query term cotion
We could look up co AND tion in a B-tree and intersect the two term sets Expensive
The solution transform wild-card queries so that the always occurs at the end
This gives rise to the Permuterm index Alternative K-gram index with postfiltering
53
Sec 32
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Spelling correction Isolated word correction Given a lexicon and a character sequence Q return the
words in the lexicon closest to Q How do we define ldquoclosestrdquo We studied several alternatives
1 Edit distance (Levenshtein distance)2 Weighted edit distance3 ngram overlap
Context-sensitive spelling correction Try all possible resulting phrases with one word
ldquocorrectedrdquo at a time and suggest the one with most hits54
Sec 332
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Week 5 Index construction Sort-based indexing Blocked Sort-Based Indexing
Merge sort is effective for disk-based sorting (avoid seeks)
Single-Pass In-Memory Indexing No global dictionary - Generate separate dictionary for each block Donrsquot sort postings - Accumulate postings as they occur
Distributed indexing using MapReduce Dynamic indexing Multiple indices logarithmic merge
55
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks) 8-byte (4+4) records (termID docID) As terms are of variable length create a dictionary to map
terms to termIDs of 4 bytes
These are generated as we parse docs Must now sort 100M 8-byte records by termID Define a Block as ~ 10M such records Can easily fit a couple into memory Will have 10 such blocks for our collection
Basic idea of algorithm Accumulate postings for each block sort write to disk Then merge the blocks into one long sorted order
Sec 42
56Information Retrieval
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
SPIMI Single-pass in-memory indexing
Key idea 1 Generate separate dictionaries for each block ndash no need to maintain term-termID mapping across blocks
Key idea 2 Donrsquot sort Accumulate postings in postings lists as they occur
With these two ideas we can generate a complete inverted index for each block
These separate indices can then be merged into one big index
Sec 43
57
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Distributed Indexing MapReduce Data flow
58
hellip
MasterParsers
a-b c-d y-zhellip
a-b c-d y-zhellip
a-b c-d y-zhellip
Inverters
Map phase Reduce phaseSegment files
a-b
c-d
y-z
Postings
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Dynamic Indexing 2nd simplest approach Maintain ldquobigrdquo main index New docs go into ldquosmallrdquo (in memory) auxiliary index Search across both merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation
bit-vector Periodically re-index into one main index Assuming T total of postings and n as size of auxiliary
index we touch each posting up to floor(Tn) times Alternatively we can apply Logarithm merge in which
each posting is merged up to log T times
Sec 45
59
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Logarithmic merge Idea maintain a series of indexes each twice as large
as the previous one Keep smallest (Z0) in memory Larger ones (I0 I1 hellip) on disk If Z0 gets too big (gt n) write to disk as I0
or merge with I0 (if I0 already exists) as Z1
Either write merge Z1 to disk as I1 (if no I1)Or merge with I1 to form Z2
hellip etc
Sec 45
60Information Retrieval
Loop
for
log lev
els
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Week 6 Index Compression Collection and vocabulary statistics Heapsrsquo
and Zipfrsquos laws
Compression to make index smaller faster Dictionary compression for Boolean indexes Dictionary string blocks front coding
Postings compression Gap encoding
61
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Index CompressionDictionary-as-a-String and Blocking Store pointers to every kth term string Example below k=4
Need to store term lengths (1 extra byte)
62
hellip7systile9syzygetic8syzygial6syzygy11szaibelyite hellip
Save 9 bytes on 3 pointers
Lose 4 bytes onterm lengths
Sec 52
CS3245 ndash Information Retrieval
Front coding Sorted words commonly have long common prefix ndash
store differences only Used for last k-1 terms in a block of k8automata8automate9automatic10automation
Information Retrieval 63
rarr8automata1loze2lozic3lozion
Encodes automat Extra lengthbeyond automat
Begins to resemble general string compression
Sec 52
CS3245 ndash Information Retrieval
Postings CompressionPostings file entry We store the list of docs containing a term in
increasing order of docID computer 3347154159202 hellip
Consequence it suffices to store gaps 3314107543 hellip
Hope most gaps can be encodedstored with far fewer than 20 bits
64
Sec 53
CS3245 ndash Information Retrieval
Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110
10111000 10000101 00001101
00001100 10110001
65
Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001
Key property VB-encoded postings areuniquely prefix-decodable
For a small gap (5) VB uses a whole byte
Sec 53
512+256+32+16+8 = 824
CS3245 ndash Information Retrieval
Week 7 Vector space ranking
1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf
vector3 Compute the cosine similarity score for the query
vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user
66
CS3245 ndash Information Retrieval
tf-idf weighting
The tf-idf weight of a term is the product of its tfweight and its idf weight
Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection67
)df(log)tflog1(w 10 tdt Ndt
times+=
Sec 622
CS3245 ndash Information Retrieval
Queries as vectors Key idea 1 Do the same for queries represent them
as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their
proximity to the query in this space
proximity = similarity of vectors asymp inverse of distance asymp cosine similarity
Motivation Want to rank more relevant documents higher than less relevant documents
68
Sec 63
CS3245 ndash Information Retrieval
For length-normalized vectors cosine similarity is simply the dot product (or scalar product)
for length normalized 119902 and 119889119889
Cosine for length-normalized vectors
Information Retrieval 69
CS3245 ndash Information Retrieval
Information Retrieval 70
Sec 64
tf-idf weighting has many variants
CS3245 ndash Information Retrieval
Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to
compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query
term proximity
Resources for today IIR 7 61
71
CS3245 ndash Information Retrieval
Recap Computing cosine scores
72
Sec 633
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Postings CompressionPostings file entry We store the list of docs containing a term in
increasing order of docID computer 3347154159202 hellip
Consequence it suffices to store gaps 3314107543 hellip
Hope most gaps can be encodedstored with far fewer than 20 bits
64
Sec 53
CS3245 ndash Information Retrieval
Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110
10111000 10000101 00001101
00001100 10110001
65
Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001
Key property VB-encoded postings areuniquely prefix-decodable
For a small gap (5) VB uses a whole byte
Sec 53
512+256+32+16+8 = 824
CS3245 ndash Information Retrieval
Week 7 Vector space ranking
1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf
vector3 Compute the cosine similarity score for the query
vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user
66
CS3245 ndash Information Retrieval
tf-idf weighting
The tf-idf weight of a term is the product of its tfweight and its idf weight
Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection67
)df(log)tflog1(w 10 tdt Ndt
times+=
Sec 622
CS3245 ndash Information Retrieval
Queries as vectors Key idea 1 Do the same for queries represent them
as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their
proximity to the query in this space
proximity = similarity of vectors asymp inverse of distance asymp cosine similarity
Motivation Want to rank more relevant documents higher than less relevant documents
68
Sec 63
CS3245 ndash Information Retrieval
For length-normalized vectors cosine similarity is simply the dot product (or scalar product)
for length normalized 119902 and 119889119889
Cosine for length-normalized vectors
Information Retrieval 69
CS3245 ndash Information Retrieval
Information Retrieval 70
Sec 64
tf-idf weighting has many variants
CS3245 ndash Information Retrieval
Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to
compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query
term proximity
Resources for today IIR 7 61
71
CS3245 ndash Information Retrieval
Recap Computing cosine scores
72
Sec 633
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Variable Byte Encoding ExampledocIDs 824 829 215406gaps 5 214577VB code 00000110
10111000 10000101 00001101
00001100 10110001
65
Postings stored as the byte concatenation00000110 10111000 10000101 00001101 00001100 10110001
Key property VB-encoded postings areuniquely prefix-decodable
For a small gap (5) VB uses a whole byte
Sec 53
512+256+32+16+8 = 824
CS3245 ndash Information Retrieval
Week 7 Vector space ranking
1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf
vector3 Compute the cosine similarity score for the query
vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user
66
CS3245 ndash Information Retrieval
tf-idf weighting
The tf-idf weight of a term is the product of its tfweight and its idf weight
Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection67
)df(log)tflog1(w 10 tdt Ndt
times+=
Sec 622
CS3245 ndash Information Retrieval
Queries as vectors Key idea 1 Do the same for queries represent them
as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their
proximity to the query in this space
proximity = similarity of vectors asymp inverse of distance asymp cosine similarity
Motivation Want to rank more relevant documents higher than less relevant documents
68
Sec 63
CS3245 ndash Information Retrieval
For length-normalized vectors cosine similarity is simply the dot product (or scalar product)
for length normalized 119902 and 119889119889
Cosine for length-normalized vectors
Information Retrieval 69
CS3245 ndash Information Retrieval
Information Retrieval 70
Sec 64
tf-idf weighting has many variants
CS3245 ndash Information Retrieval
Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to
compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query
term proximity
Resources for today IIR 7 61
71
CS3245 ndash Information Retrieval
Recap Computing cosine scores
72
Sec 633
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Week 7 Vector space ranking
1 Represent the query as a weighted tf-idf vector2 Represent each document as a weighted tf-idf
vector3 Compute the cosine similarity score for the query
vector and each document vector4 Rank documents with respect to the query by score5 Return the top K (eg K = 10) to the user
66
CS3245 ndash Information Retrieval
tf-idf weighting
The tf-idf weight of a term is the product of its tfweight and its idf weight
Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection67
)df(log)tflog1(w 10 tdt Ndt
times+=
Sec 622
CS3245 ndash Information Retrieval
Queries as vectors Key idea 1 Do the same for queries represent them
as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their
proximity to the query in this space
proximity = similarity of vectors asymp inverse of distance asymp cosine similarity
Motivation Want to rank more relevant documents higher than less relevant documents
68
Sec 63
CS3245 ndash Information Retrieval
For length-normalized vectors cosine similarity is simply the dot product (or scalar product)
for length normalized 119902 and 119889119889
Cosine for length-normalized vectors
Information Retrieval 69
CS3245 ndash Information Retrieval
Information Retrieval 70
Sec 64
tf-idf weighting has many variants
CS3245 ndash Information Retrieval
Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to
compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query
term proximity
Resources for today IIR 7 61
71
CS3245 ndash Information Retrieval
Recap Computing cosine scores
72
Sec 633
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
tf-idf weighting
The tf-idf weight of a term is the product of its tfweight and its idf weight
Best known weighting scheme IR Note the ldquo-rdquo in tf-idf is a hyphen not a minus sign Alternative names tfidf tf x idf
Increases with the number of occurrences within a document
Increases with the rarity of the term in the collection67
)df(log)tflog1(w 10 tdt Ndt
times+=
Sec 622
CS3245 ndash Information Retrieval
Queries as vectors Key idea 1 Do the same for queries represent them
as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their
proximity to the query in this space
proximity = similarity of vectors asymp inverse of distance asymp cosine similarity
Motivation Want to rank more relevant documents higher than less relevant documents
68
Sec 63
CS3245 ndash Information Retrieval
For length-normalized vectors cosine similarity is simply the dot product (or scalar product)
for length normalized 119902 and 119889119889
Cosine for length-normalized vectors
Information Retrieval 69
CS3245 ndash Information Retrieval
Information Retrieval 70
Sec 64
tf-idf weighting has many variants
CS3245 ndash Information Retrieval
Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to
compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query
term proximity
Resources for today IIR 7 61
71
CS3245 ndash Information Retrieval
Recap Computing cosine scores
72
Sec 633
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Queries as vectors Key idea 1 Do the same for queries represent them
as vectors in the space they are ldquomini-documentsrdquo Key idea 2 Rank documents according to their
proximity to the query in this space
proximity = similarity of vectors asymp inverse of distance asymp cosine similarity
Motivation Want to rank more relevant documents higher than less relevant documents
68
Sec 63
CS3245 ndash Information Retrieval
For length-normalized vectors cosine similarity is simply the dot product (or scalar product)
for length normalized 119902 and 119889119889
Cosine for length-normalized vectors
Information Retrieval 69
CS3245 ndash Information Retrieval
Information Retrieval 70
Sec 64
tf-idf weighting has many variants
CS3245 ndash Information Retrieval
Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to
compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query
term proximity
Resources for today IIR 7 61
71
CS3245 ndash Information Retrieval
Recap Computing cosine scores
72
Sec 633
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
For length-normalized vectors cosine similarity is simply the dot product (or scalar product)
for length normalized 119902 and 119889119889
Cosine for length-normalized vectors
Information Retrieval 69
CS3245 ndash Information Retrieval
Information Retrieval 70
Sec 64
tf-idf weighting has many variants
CS3245 ndash Information Retrieval
Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to
compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query
term proximity
Resources for today IIR 7 61
71
CS3245 ndash Information Retrieval
Recap Computing cosine scores
72
Sec 633
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Information Retrieval 70
Sec 64
tf-idf weighting has many variants
CS3245 ndash Information Retrieval
Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to
compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query
term proximity
Resources for today IIR 7 61
71
CS3245 ndash Information Retrieval
Recap Computing cosine scores
72
Sec 633
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Week 8 Complete Search SystemsMaking the Vector Space Model more efficient to
compute Approximating the actual correct results Skipping unnecessary documentsIn actual data dealing with zones and fields query
term proximity
Resources for today IIR 7 61
71
CS3245 ndash Information Retrieval
Recap Computing cosine scores
72
Sec 633
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Recap Computing cosine scores
72
Sec 633
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Generic approach Find a set A of contenders with K lt |A| ltlt N A does not necessarily contain the top K but has
many docs from among the top K Return the top K docs in A
Think of A as pruningnon-contenders
The same approach can also used for other (non-cosine) scoringfunctions
73
Sec 711
NJ
A
K
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Heuristics Index Elimination High-idf query terms only Docs containing many query terms
Champion lists Generalized to high-low lists and tiered index
Impact ordering postings Early termination idf ordered terms
Cluster pruning
74
Sec 711
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Net score Consider a simple total score combining cosine
relevance and authority
net-score(qd) = g(d) + cosine(qd)
Can use some other linear combination than an equal weighting
Indeed any function of the two ldquosignalsrdquo of user happiness
Now we seek the top K docs by net score
75
Sec 714
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Parametric Indices
Fields Year = 1601 is an example of
a field Field or parametric index
postings for each field value Sometimes build range (B-
tree) trees (eg for dates)
Field query typically treated as conjunction (doc must be authored by
shakespeare)
Zone A zone is a region of the doc
that can contain an arbitrary amount of text eg Title Abstract References hellip
Build inverted indexes on zones as well to permit querying
76
Sec 61
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Week 9 IR Evaluation How do we know if our results are any good Benchmarks Precision and Recall Composite measures Test collection
AB Testing
77
Ch 8
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
A precision-recall curve
00
02
04
06
08
10
00 02 04 06 08 10Recall
Pre
cisi
on
78
Sec 84
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Kappa measure for inter-judge (dis)agreement
Information Retrieval 79
Sec 85
Kappa measure Agreement measure among judges Designed for categorical judgments Corrects for chance agreement
Kappa (K) = [P(A)-P(E)][1-P(E)] P(A) ndash proportion of time judges agree P(E) ndash what agreement would be by chance
Gives 0 for chance agreement 1 for total agreement
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
AB testingPurpose Test a single innovationPrerequisite You have a large system with many users
Have most users use old system Divert a small proportion of traffic (eg 1) to the new
system that includes the innovation Evaluate with an ldquoautomaticrdquo overall evaluation criterion
(OEC) like clickthrough on first result Now we can directly see if the innovation does improve user
happiness Probably the evaluation methodology that large search
engines trust most80
Sec 863
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Week 10 Relevance Feedback Query Expansion and XML IR
Chapter 103 XML IR Basic XML concepts Challenges in XML IR Vector space model
for XML IR Evaluation of XML IR
Chapter 9 Query Refinement1 Relevance Feedback
Document Level
Explicit RF ndash Rocchio (1971) When does it work
Variants Implicit and Blind
2 Query ExpansionTerm Level
Manual thesaurus
Automatic Thesaurus Generation
81
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval Sec 922
82
Rocchio (1971)
In practice
Dr = set of known relevant doc vectorsDnr = set of known irrelevant doc vectors Different from Cr and Cnr as we only get judgments
from a few documentsαβγ = weights (hand-chosen or set empirically)
Popularized in the SMART system (Salton)
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Query Expansion
In relevance feedback users give additional input (relevantnon-relevant) on documents which is used to reweight terms in the documents
In query expansion users give additional input (goodbad search term) on words or phrases
Sec 922
83
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Co-occurrence ThesaurusSimplest way to compute one is based on term-term similarities in C = AAT where A is term-document matrixwij = (normalized) weight for (ti dj)
For each ti pick terms with high values in C
t i
dj N
M
Sec 923
In NLTK Did you forget
84
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Main idea lexicalized subtrees Aim to have each dimension of the vector space
encode a word together with its position within the XML tree
How Map XML documents to lexicalized subtrees
85
Book
Title Author
Bill GatesMicrosoft
Author
Bill Gates
Microsoft Bill Gates
Title
Microsoft
Author
Gates
Author
Bill
Book
Title
Microsoft
Book
ldquoWith wordsrdquo
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Context resemblance A simple measure of the similarity of a path cq in a query and a path
cq in a document is the following context resemblance function CR
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd iff we can transform cq into cd by inserting additional nodes
86
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
SimNoMerge example
87
ltc1tgt
query Inverted index
ltc1tgt
ltc2tgt
ltc3tgt
dictionary
ltd105gt
postings
ltd401gt ltd902gt
ltd2025gt ltd301gt ltd1209gt
ltd307gt ltd608gt ltd905gt
This example is slightly different from book
CR(c1c1) = 10
CR(c1c2) = 00
CR(c1c3) = 060
if wq = 10 then sim(qd9) = (10times10 x 02) + (06times10 x 05) = 5
All weights have been normalized
eg authorldquoBillrdquo
eg authorldquoBillrdquo
eg titleldquoBillrdquo
eg bookauthorfirstnameldquoBillrdquo Context Resemblance Query Term Weight Document Term Weight
OK to ignore Query vs ltc2 tgt since CR = 00
Query vs ltc1tgt Query vs ltc3tgt
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
INEX relevance assessments The relevance-coverage combinations are quantized as follows
This evaluation scheme takes account of the fact that binary relevance judgments are not appropriate for XML retrieval The quantization function Q instead allows us to grade each component as partially relevant The number of relevant components in a retrieved set A of components can then be computed as
88
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Week 11 Probabilistic IRChapter 111 Probabilistic Approach to Retrieval
Basic Probability Theory2 Probability Ranking Principle3 Binary Independence Model BestMatch25 (Okapi)
Chapter 121 Language Models for IR
Ch 11-12
89
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
90
Binary Independence Model (BIM) Traditionally used with the PRP
Assumptions Binary (equivalent to Boolean) documents and
queries represented as binary term incidence vectors Eg document d represented by vector x = (x1 xM) where xt = 1 if term t occurs in d and xt = 0 otherwise Different documents may have the same vector representation
Independence no association between terms (not true but works in practice ndash naiumlve assumption)
rarr
Sec 113
90
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
So it boils down to computing RSV We can rank documents using the log oods ratios for the terms in the query 119862119862119905119905
The odds ratio is the ratio of two odds
(i) the odds of the term appearing if relevant ( 1199011199011199051199051minus119901119901119905119905
) and
(ii) the odds of the term appearing if nonrelevant ( 1199061199061199051199051minus119906119906119905119905
)
119888119888119905119905 = 0 if a term has equal odds of appearing in relevant and non-relevant documents and c119905119905 is positive if it is more likely to appear in relevant documents
119888119888119905119905 functions as a term weight so that
Operationally we sum 119888119888119905119905 quantities in accumulators for query terms appearing in documents just like in VSM
91
Deriving a Ranking Function for Query Terms
Information Retrieval
Sec 1131
91
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Okapi BM25 A Nonbinary Model If the query is long we might also use similar weighting for query
terms
tftq term frequency in the query q k3 tuning parameter controlling term frequency scaling of the
query No length normalization of queries
(because retrieval is being done with respect to a single fixed query) The above tuning parameters should be set by optimization on a
development test collection Experiments have shown reasonable values for k1 and k3 as values between 12 and 2 and b = 075
Sec 1143
92
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
An Appraisal of Probabilistic Models The difference between lsquovector spacersquo and
lsquoprobabilisticrsquo IR is not that great In either case you build an information retrieval scheme in
the exact same way Difference for probabilistic IR at the end you score
queries not by cosine similarity and tf-idf in a vector space but by a slightly different formula motivated by probability theory
Sec 1141
93
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Using language models in IR Each document is treated as (the basis for) a language
model Given a query q rank documents based on P(d|q)
P(q) is the same for all documents so ignore P(d) is the prior ndash often treated as the same for all d
But we can give a prior to ldquohigh-qualityrdquo documents eg those with high static quality score g(d) (cf Section 714)
P(q|d) is the probability of q given d So to rank documents according to relevance to q
ranking according to P(q|d) and P(d|q) is equivalent
94
Sec 122
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Mixture model Summary
What we model The user has a document in mind and generates the query from this document
The equation represents the probability that the document that the user had in mind was in fact this one
95
Sec 1222
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Week 12 Web SearchChapter 20 Crawling
Chapter 21 Link Analysis
96
Ch 20-21
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Crawling Politeness need to crawl only the allowed pages and
space out the requests for a site over a longer period (hours days)
Duplicates need to integrate duplicate detection
Scale need to use a distributed approach
Spam and spider traps need to integrate spam detection
Information Retrieval 97
Sec 201
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Robotstxt Protocol for giving crawlers (ldquorobotsrdquo) limited access
to a website originally from 1994 Example
User-agent Disallow yoursitetemp
User-agent searchengine Disallow
Important cache the robotstxt file of each site we are crawling
Information Retrieval 98
Sec 202
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
99
Pagerank algorithm
Information Retrieval
Sec 2122
Regardless of where we start we eventually reach the steady state a1 Start with any distribution (say x = (1 0 hellip 0))2 After one step were at xA3 After two steps at xA2 then xA3 and so on4 Eventually means for large k xAk = a
Algorithm multiply x by increasing powers of A until the product looks stable
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
100
Pagerank summary Pre-processing Given graph of links build matrix A From it compute a The pagerank ai is a scaled number between 0 and 1
Query processing Retrieve pages meeting query Rank them by their pagerank Order is query-independent
Sec 2122
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
101
WHERE TO GO FROM HERE
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Learning Objectives You have learnt how to build your own search
engine
In addition you have picked up skills that are useful for your future Python ndash one of the easiest and more straightforward
programming languages to use NLTK ndash A good set of routines and data that are useful in
dealing with NLP and IR
102
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
103Photo credits httpwwwflickrcomphotosaperezdc
Adversarial IR
Boolean IRVSM
Prob IR
Computational Advertising
Recommendation Systems
Digital Libraries
Natural Language Processing
Question Answering
IR Theory
Distributed IR
Geographic IR
Social Network IR
Query Analysis
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
104
Opportunities in IR
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-
CS3245 ndash Information Retrieval
Thanks for joining us for the journey
105
- Slide Number 1
- Last Time
- Today
- Web search overview
- IR on the web vs IR in general
- Without search engines the web wouldnrsquot work
- Interest aggregation
- SEARCHADVERTISING
- 1st Generation of Search AdsGoto (1996)
- 1st Generation of search adsGoto (1996)
- 2nd generation of search ads Google (2000)
- Do ads influence editorial content
- How are ads ranked
- How are ads ranked
- Googlersquos second price auction
- Googlersquos second price auction
- Keywords with high bids
- Search ads A win-win-win
- Not a win-win-win Keyword arbitrage
- Not a win-win-win Violation of trademarks
- DUPLICATEDETECTION
- Duplicate detection
- Near-duplicates Example
- Detecting near-duplicates
- Recall Jaccard coefficient
- Jaccard coefficient Example
- A document as set of shingles
- Fingerprinting
- Documents as sketches
- Slide Number 30
- Proof that J(S(di)s(dj)) cong P(xi = xj)
- Estimating Jaccard
- Shingling Summary
- Summary
- EXAM MATTERS
- Slide Number 36
- Exam Format
- Exam Topics
- Exam Topic Distribution
- REVISION
- Week 1 Ngram Models
- Unigram LM with Add 1 smoothing
- Week 2 Basic (Boolean) IR
- Indexer steps Dictionary amp Postings
- The merge
- Week 3 Postings lists and Choosing terms
- Adding skip pointers to postings
- Biword indexes
- Positional index example
- Choosing terms
- Slide Number 51
- Hash Table
- Wildcard queries
- Spelling correction
- Week 5 Index construction
- BSBI Blocked sort-based Indexing (Sorting with fewer disk seeks)
- SPIMI Single-pass in-memory indexing
- Distributed Indexing MapReduce Data flow
- Dynamic Indexing 2nd simplest approach
- Logarithmic merge
- Week 6 Index Compression
- Index CompressionDictionary-as-a-String and Blocking
- Front coding
- Postings CompressionPostings file entry
- Variable Byte Encoding Example
- Week 7 Vector space ranking
- tf-idf weighting
- Queries as vectors
- Cosine for length-normalized vectors
- Slide Number 70
- Week 8 Complete Search Systems
- Recap Computing cosine scores
- Generic approach
- Heuristics
- Net score
- Parametric Indices
- Week 9 IR Evaluation
- A precision-recall curve
- Kappa measure for inter-judge (dis)agreement
- AB testing
- Week 10 Relevance Feedback Query Expansion and XML IR
- Rocchio (1971)
- Query Expansion
- Co-occurrence Thesaurus
- Main idea lexicalized subtrees
- Context resemblance
- SimNoMerge example
- INEX relevance assessments
- Week 11 Probabilistic IR
- Binary Independence Model (BIM)
- Deriving a Ranking Function for Query Terms
- Okapi BM25 A Nonbinary Model
- An Appraisal of Probabilistic Models
- Using language models in IR
- Mixture model Summary
- Week 12 Web Search
- Crawling
- Robotstxt
- Pagerank algorithm
- Pagerank summary
- WHERE TO GO FROM HERE
- Learning Objectives
- Slide Number 103
- Slide Number 104
- Slide Number 105
-