implementation issues arjen p. de vries centrum voor wiskunde en informatica amsterdam [email protected]

116
Implementation Issues Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam [email protected]

Upload: britton-green

Post on 12-Jan-2016

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Implementation IssuesImplementation Issues

Arjen P. de VriesCentrum voor Wiskunde en Informatica

[email protected]

Page 2: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

OverviewOverview

• Indexing collections• Some text statistics• File organization

– Inverted files– Signature files

• Search engines• Indexing XML

Page 3: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Supporting the Search Supporting the Search ProcessProcess

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Page 4: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Supporting the Search Supporting the Search ProcessProcess

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Page 5: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Some Questions for TodaySome Questions for Today

• How long will it take to find a document?– Is there any work we can do in advance?

• If so, how long will that take?

• How big a computer will I need?– How much disk space? How much RAM?

• What if more documents arrive?– How much of the advance work must be

repeated?– Will searching become slower?– How much more disk space will be needed?

Page 6: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Text StatisticsText Statistics

Page 7: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Frequent words on the Frequent words on the WWWWWW• 65002930 the

• 62789720 a• 60857930 to• 57248022 of• 54078359 and• 52928506 in• 50686940 s• 49986064 for• 45999001 on• 42205245 this• 41203451 is• 39779377 by• 35439894 with• 35284151 or• 34446866 at• 33528897 all• 31583607 are

• 30998255 from• 30755410 e• 30080013 you• 29669506 be• 29417504 that• 28542378 not• 28162417 an• 28110383 as• 28076530 home• 27650474 it• 27572533 i• 24548796 have• 24420453 if• 24376758 new• 24171603 t• 23951805 your• 23875218 page

• 22292805 about• 22265579 com• 22107392 information• 21647927 will• 21368265 can• 21367950 more• 21102223 has• 20621335 no• 19898015 other• 19689603 one• 19613061 c• 19394862 d• 19279458 m• 19199145 was• 19075253 copyright• 18636563 us

(see http://elib.cs.berkeley.edu/docfreq/docfreq.html)

Page 8: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Plotting Word Frequency by Plotting Word Frequency by RankRank• Main idea:

– Count how many times tokens occur in the text

• Sum over all of the texts in the collection

• Now order these tokens according to how often they occur (highest to lowest)

• This is called the rank

Page 9: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

• The product of the frequency of words (f) and their rank (r) is approximately constant– Rank = order of words’ frequency of occurrence

• Another way to state this is with an approximately correct rule of thumb:– Say the most common term occurs C times– The second most common occurs C/2 times– The third most common occurs C/3 times

Zipf DistributionZipf Distribution

10/

/1

NC

rCf

Page 10: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Illustration by Jacob Nielsen

Zipf DistributionZipf Distribution

Linear Scale Logarithmic Scale

Page 11: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

What has a Zipf What has a Zipf Distribution?Distribution?• Words in a text collection

– Virtually any use of natural language

• Library book checkout patterns• Incoming Web Page Requests (Nielsen)• Outgoing Web Page Requests (Cunha &

Crovella)• Document Size on Web (Cunha &

Crovella)

Page 12: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Related Distributions/LawsRelated Distributions/Laws

• For more examples, see Robert A. Fairthorne, “Empirical Distributions (Bradford-Zipf-Mandelbrot) for Bibliometric Description and Prediction,” Journal of Documentation, 25(4), 319-341

• Pareto distribution of wealth; Willis taxonomic distribition in biology; Mandelbrot on self-similarity and market prices and communication errors; etc.

Page 13: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Consequences of ZipfConsequences of Zipf

• There are always a few very frequent tokens that are not good discriminators.– Called “stop words” in IR– Usually correspond to linguistic notion of “closed-

class” words• English examples: to, from, on, and, the, ...• Grammatical classes that don’t take on new members.

• There are always a large number of tokens that occur once (and can have unexpected consequences with some IR algorithms)

• Medium frequency words most descriptive

Page 14: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Word Frequency vs. Word Frequency vs. Resolving PowerResolving Power

The most frequent words are not the most descriptive.

(from Van Rijsbergen 79)

Page 15: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

File OrganizationFile Organization

Page 16: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

File Structures for IRFile Structures for IR

• Lexicographical indices– indices that are sorted– inverted files– Patricia (PAT) trees (Suffix trees and arrays)

• Indices based on hashing– signature files

Page 17: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Document Vectors Document Vectors ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)

Page 18: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Document Vectors Document Vectors ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3

“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I

Page 19: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Document Vectors Document Vectors ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3

Page 20: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Inverted IndexInverted Index

• The primary data structure for text indexes

• Main Idea:– Invert documents into a big index

• Basic steps:– Make a “dictionary” of all the tokens in the

collection– For each token, list all the docs it occurs in.– Do a few things to reduce redundancy in the

data structure

Page 21: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Inverted IndexesInverted Indexes

• An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Page 22: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

How Inverted Files Are How Inverted Files Are CreatedCreated• Documents are parsed to extract

tokens• Save token with Document ID

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 23: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

What are the tokens?What are the tokens?

• Stemming?• Case folding?• Thesauri and/or soundex?• Spelling correction?

Page 24: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

How Inverted Files are CreatedHow Inverted Files are Created

• After all documents have been parsed the inverted file is sorted alphabetically

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

Page 25: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

How Inverted Files are CreatedHow Inverted Files are Created

• Multiple term entries for a single document are merged

• Within-document term frequency information is compiled

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Page 26: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

How Inverted Files are CreatedHow Inverted Files are Created

Dictionary PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Page 27: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

SearchingSearching

• Vocabulary search– Identify the words and patterns in the query– Search them in the vocabulary

• Retrieval of occurrences– Retrieve the lists of occurrences of all the

words

• Manipulation of occurrences– Solve phrases, proximity, or Boolean

operations– Find the exact word positions when block

addressing is used

Th

ree

gen

eral

st

eps

Page 28: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

How Inverted Files are UsedHow Inverted Files are Used

Dictionary PostingsDoc # Freq

2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Boolean Query on “time” AND “dark”

2 docs with “time” in dictionary ->IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->ID 2 from posting file

Therefore, only doc 2 satisfied the query.

Page 29: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Query optimizationQuery optimization

• Consider a query that is an AND of t terms.

• The idea: for each of the t terms, get its term-doc incidence from the postings, then AND together.

• Process in order of increasing freq:– start with smallest set,

then keep cutting further.

Keep tfin dictionary!

Page 30: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Processing for ranking Processing for ranking systemssystems

For each query termfor each document in inverted list

augment similarity coefficient

For each documentfinish calculation of similarity coefficient

Perform sort of similarity coefficientsRetrieve and present document

Page 31: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Processing Vector Space Processing Vector Space Queries (I)Queries (I)Set A {}. A is the set of accumulators

For each query term t Q,Stem tSearch the lexicon, record ft and the address

of ItSet wt 1 + loge (N/ft)

Page 32: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Processing VS Queries IIProcessing VS Queries II

Read the inverted file entry, ItFor each (d,fd,t) pair in It

If Ad A thenset Ad 0set A A + {Ad}

Set Ad A + loge (1 + fd,t)*wt

For each Ad ASet Ad Ad/Wd (where Wd is weight of document D)

AAd is now proportional to the value cosine(Q,Dd)

Page 33: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

How Large is the Dictionary? How Large is the Dictionary?

#docs

#uni

que

term

s

1.0e6

2.0e6

• Heaps’ law: – the vocabulary grows as O(n), : 0.4~0.6

Page 34: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Term Storage in DictionaryTerm Storage in Dictionary

• Store pointers to every kth on term string.

• Need to store term lengths (1 extra byte)….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

7

Save 9 bytes on 3 pointers.

Lose 4 bytes onterm lengths.

Page 35: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Searching the DictionarySearching the Dictionary

• Sorted Arrays– store the list of keywords in a sorted array– using a standard binary search– advantage: easy to implement– disadvantage: updating the index is

expensive

• Hashing Structures • B-Trees, Tries, Pat(ricia) Trees, …• Combinations of these structures

Page 36: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

The lexicon as a Hash TableThe lexicon as a Hash Table

• Store terms in a Hash table:– Map the M terms into a set of m integer values 0 <=

h(ti) <= m-1 (m << M).– A simple hash function for strings:

– Retrieval time: O(|t|)– Cons:

• collisions vs. space• Cannot handle wildcards

mwttht

iii mod )(

1

Page 37: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

TriesTries

• A structure constructed from strings• Name comes from the word retrieval• Used for string search in normal text• Each edge represents a character of the

string including the terminator $• Leaves contain the keys

Page 38: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

TRIE - Non-compactTRIE - Non-compact

Consider 5 character strings:

BIG BIGGERBILLGOODGONG

Each character stored in one edge of the tree

E

B G

I

LG

BIG$

BIGGER$

BILL$GOOD$ GONG$

L

$

$

$

$ $

O

O

D

N

GG

R

Page 39: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

TRIE - PropertiesTRIE - Properties

• An internal node can have from 1 to d children where d is the size of the alphabet.

• A path from the root of T to an internal node i corresponds to an i-character prefix of a string S

• The height of the tree is the length of the longest string

• If there are S unique strings, T has S external nodes

• Looking up a string of length M is O(M)

Page 40: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Compact TrieCompact TrieRemove chains leading to leaves

E

B G

I

LG

BIG$

BIGGER$

BILL$GOOD$ GONG$

L

$

$

$

$ $

O

O

D

N

GG

R

B G

I

LG

BIG$BIGGER$

BILL$ GOOD$ GONG$

$

O

O N

G

Page 41: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

PATRICIA (PAT TREE)PATRICIA (PAT TREE)PPractical AAlgorithm TTo RRetrieve IInformation CCoded IIn AAlphabets -

introduced by D.R. Morrison in October 1968Collapse all unary nodes

B G

I

LG

BIG$BIGGER$

BILL$ GOOD$ GONG$

$

O

O N

G

BI GO

LG

BIG$BIGGER$

BILL$

GOOD$ GONG$$

O N

G

Page 42: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

PATRICIA - CommentsPATRICIA - Comments

• Number of nodes proportional to number of strings

• Strings on edges become variable length• Use an auxiliary data structure to actually

store the strings• The TRIE only stores triplets of numbers

indicating where in the auxiliary data structure to look

Page 43: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Partially specified query (*)Partially specified query (*)

• Using n-grams– E.g., decompose labor into la-ab-bo-or– Mark begin/end of word with special char

($)– Labor ->$l-la-ab-bo-or-r$

• Answers to lab*r are obtained via$l and la and ab and r$

Page 44: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Use of bigramsUse of bigrams

Number Term

12345678

AbhorBearLaaberLaborLaboratorLabourLavacaberSlab

Bigram Term id

$a$b$l$saaabbolaorouraryr$sl

123,4,5,6,7831,3,4,5,6,7,84,5,63,4,5,6,7,81,4,56551,2,3,4,5,6,78

Page 45: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Problems with bi-gramsProblems with bi-grams

• Query $l and la and ab and r$– might be false match: – In example, will match laaber and lavacaber (from

TREC collection).. – Why: all bigrams are present, but not consecutive

• What about labrador?– Right answer but not what was intended– Wildcards with n-grams are risky!

• False matches can be checked with additional exact pattern matcher

Page 46: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Can we support wildcards with Can we support wildcards with TRIETRIE

• Easy to answer queries with wildcard tails– E.g., lab*, labor*, …

• What about lab*r or *bour?• Use a rotated lexicon

– Store: $labor, abor$l, bor$la, or$lab, r$labo, $labor

• To search for lab*r– rotate until wildcard is at end: r$lab* then– search for strings that have this prefix

Page 47: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Rotated LexiconRotated Lexicon

id Term

12345678

AbhorBearLaaberLaborLaboratorLabourLavacaberSlab

Rotated Add

$abhor$bear$laaber$labor$laborator$labour$lavacaber$slabaaber$labhor$aber$laabor$l

(1,0)(2,0)(3,0)(4,0)(5,0)(6,0)(7,0)(8,0)(3,2)(1,1)(3,3)(4,2)

Rotated Add

aborator$labour$laber$lavacab$slr$abhor$bear$laaber$labor$laborator$labour$lavacabeslab$

(5,2)(6,2)(7,6)(8,3)(1,5)(2,4)(3,6)(4,5)(5,9)(6,6)(7,9)(8,1)

Page 48: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

What Goes in a Postings What Goes in a Postings File?File?• Boolean retrieval

– Just the document number

• Ranked Retrieval– Document number and term weight (TF*IDF,

...)

• Proximity operators– Word offsets for each occurrence of the

term• Example: Doc 3 (t17, t36), Doc 13 (t3, t45)

Page 49: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

How Big Is the Postings File?How Big Is the Postings File?

• Very compact for Boolean retrieval– About 10% of the size of the documents

• If an aggressive stopword list is used!

• Not much larger for ranked retrieval– Perhaps 20%

• Enormous for proximity operators– Sometimes larger than the documents!

Page 50: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Indexing Small

(1 MB)

collection Medium

(200MB)

collection Large

(2GB)

Collection

Addressing

words

45% 73% 36% 64% 35% 63%

Addressing

documents

19% 26% 18% 32% 26% 47%

Addressing

64k blocks

27% 41% 18% 32% 5% 9%

Addressing

256 blocks

18% 25% 1.7% 2.4% 0.5% 0.7%

Full inversion (all words, exact positions, 4-byte pointers)

2 or 1 byte(s) per pointer independent of the text sizedocument size (10KB), 1, 2, 3 bytes per pointer, depending on text size

All words are indexedStop words are not indexed

Page 51: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Inverted File CompressionInverted File Compression

• Key idea, each postings list can be stored as an ascending sequence of integers

• Example: if t occurs in 8 documents, list the postings in ascending doc id– <8;3,5,20,21,23,76,77,78>

• Storing the gaps would change to <8; [3,2,15,1,2,53,1,1]>

Page 52: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Inverted File CompressionInverted File Compression

• More formally, if t appears in nt docs– <nt;d1,d2,…,dft>

– where dk < dk+1

• The postings can be equivalently stored as an initial position followed by a list of d-gaps: dk+1-dk

• Example– From <8;3,5,20,21,23,76,77,78> – To <8;3,2,15,1,2,53,1,1>

Page 53: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Inverted File CompressionInverted File Compression

• No information loss, but …… where does the saving come from?– The largest d-gap in the 2nd rep is

potentially the same as the largest document id in the 1st

– If there are N documents, and a fixed binary encoding is used, both methods require log N bits per any stored pointer…

Page 54: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

IntuitionIntuition

• Frequent words have d-gaps significantly smaller than rare words

• Variable length representations are considered, in which small values are more likely and coded more economically than large one

• Various methods based on probability distribution of index numbers.

Page 55: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

BitmapsBitmaps

• For every term in the lexicon, a bitvector is stored, each bit corresponding to a document. Set to 1 if term appears in document, set to 0 otherwise

• Thus if term “pot” appears in documents 2 and 5, the corresponding bits are set– pot -> 0100100

• Bitmaps are efficient for Boolean queries– Answers to a query are obtained by simply

combining bitvectors with Boolean operators

Page 56: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Example of bitmapsExample of bitmapsterm bitvector

cold 100100

days 001001

hot 100100

in 010010

it 010010

like 000110

nine 001001

old 001001

pease 110000

porridge 110000

pot 010010

some 000110

the 010010

Query “some” AND “pot”

Answer: 010010 AND 000110Yields: 000010d5 unique answer

Page 57: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

What Are Signature Files?What Are Signature Files?

In your email programme you will find the option “signatures”. For example in outlook express go to “tools”, “options” and you will see “signatures”.

Signature files offer you as a person or business to state contact information at the end of an e-mail message. They are also known as "sig. files". Using sig. files is an opportunity to advertise is free.

Signature FilesSignature Files

Page 58: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Signature FilesSignature Files

• Signature files were popular in the past because they are implicitly compressed (and take less space than uncompressed inverted files)

• A signature file is a probabilistic method for indexing text: each document has an associated signature or descriptor, a string of bits that captures the content of the document

Page 59: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Not a new idea…Not a new idea…

Page 60: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Word signaturesWord signatures

• Assume that words have 16 bit descriptors constructed by taking 3 hash functions generating values between 1…16, and setting corresponding bits in a 16 bit descriptor

• Example– if 3 hashing values for cold are 3,5,16– Cold -> 1000 0000 0010 0100

• Very unlikely but not impossible that signatures of 2 words are identical

Page 61: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Example for given lexiconExample for given lexiconterm Hash string

cold 1000 0000 0010 0100

days 0010 0100 0000 1000

hot 0000 1010 0000 0000

in 0000 1001 0010 0000

it 0000 1000 1000 0010

like 0100 0010 0000 0001

nine 0010 1000 0000 0100

old 1000 1000 0100 0000

pease 0100 0100 0010 0000

porridge 0100 0100 0010 0000

pot 0000 0010 0110 0000

some 0100 0100 0000 0001

the 1010 1000 0000 0000

Page 62: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Document signaturesDocument signatures

• Signatures for each word in a given document are superimposed– The words signatures of each word in a

document d are ORed, to form the document signature

– Example • document 6: nine days old

Page 63: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

ExampleExample

• nine 0010 1000 0000 0100• days 0010 0100 0000 1000• old 1000 1000 0100 0000• d6 “nine days old” 1010 1100 0100 1100

• To test whether a word is in a document, calculate its signature. If the corresponding bits are set in the document, there is a high probability that the document contains the word.

Page 64: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

False hitsFalse hits

• Reduce false hits by increasing number of bits set for each term and increase length of signature

• Yet need to fetch the document at query time, and scan/parse/stem it to check that the term really occurs: costs time

• Effective when queries are long (lower probabilities of false hits)

Page 65: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Example of false hitsExample of false hits

Doc signature Text

1 1100 1111 0010 0101 Pease porridge hot, please porridge cold.

2 1110 1111 0110 0001 Pease porridge in the pot.

3 1010 1100 0100 1100 Nine days old.

4 1100 1110 1010 0111 Some like it hot, some like it cold

5 1110 1111 1110 0011 Some like it in the pot

6 1010 1100 0100 1100 Nine days old

cold 1000 0000 0010 0100old 1000 1000 0100 0000

Real matches?

1 and 42, 3, 5 and 6

Page 66: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Merits of Signature FilesMerits of Signature Files

• Faster than full text scanning– 1 or 2 orders of magnitude faster– But… still linear in collection size

• Modest space overhead– 10-15% vs. 50-300% (inversion)

• Insertions can be handled easily– append only– no reorganization or rewriting

Page 67: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Comparison of indexing Comparison of indexing methodsmethods

• Bitmap consumes an order of magnitude more storage than inverted files and signature files

• Signature files require un-necessary access to the main text because of false matches

• Signature file main advantage: no in-memory lexicon

• Compressed inverted files are the state-of-the art index structure used by most search engines

Page 68: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Search EnginesSearch Engines

Page 69: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Information from searchenginewatch.com

Searches Per Day (2000)Searches Per Day (2000)

Page 70: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Searches Per Day (2001)Searches Per Day (2001)

Page 71: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org
Page 72: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Challenges for Web SearchingChallenges for Web Searching

• Distributed data• Volatile data/”Freshness”: 40% of the web

changes every month• Exponential growth• Unstructured and redundant data: 30% of web

pages are near duplicates• Unedited data• Multiple formats• Commercial biases• Hidden data

Page 73: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Page 74: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Standard Web Search Engine ArchitectureStandard Web Search Engine Architecture

crawl theweb

create an inverted

index

Check for duplicates,store the

documents

Inverted index

Search engine servers

userquery

Show results To user

DocIds

Page 75: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

More detailed architecture,

from Brin & Page 98.

Only covers the preprocessing in

detail, not the query serving.

Page 76: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Indexes for Web Search EnginesIndexes for Web Search Engines

• Inverted indexes are still used, even though the web is so huge

• Some systems partition the indexes across different machines– Each machine handles different parts of the

data

• Other systems duplicate the data across many machines– Queries are distributed among the machines

• Most do a combination of these

Page 77: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries.

Each row can handle 120 queries per second

Each column can handle 7M pages

To handle more queries, add another row.

From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Page 78: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Querying: Cascading Allocation of CPUsQuerying: Cascading Allocation of CPUs

• A variation on this that produces a cost-savings:– Put high-quality/common pages on many

machines– Put lower quality/less common pages on

fewer machines– Query goes to high quality machines first– If no hits found there, go to other machines

Page 79: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Google Google

• Google maintains the worlds largest Linux cluster (10,000 servers)

• These are partitioned between index servers and page servers– Index servers resolve the queries (massively

parallel processing)– Page servers deliver the results of the

queries

• Over 3 Billion web pages are indexed and served by Google

Page 80: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Web Page RankingWeb Page Ranking

• Varies by search engine– Pretty messy in many cases– Details usually proprietary and fluctuating

• Combining subsets of:– Term frequencies– Term proximities– Term position (title, top of page, etc)– Term characteristics (boldface, capitalized, etc)– Link analysis information– Category information– Popularity information

Page 81: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Ranking: Link AnalysisRanking: Link Analysis

• Assumptions:– If the pages pointing to this page are good,

then this is also a good page– The words on the links pointing to this page

are useful indicators of what this page is about

– References: Page et al. 98, Kleinberg 98

Page 82: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Ranking: Link AnalysisRanking: Link Analysis

• Why does this work?– The official Toyota site will be linked to by

lots of other official (or high-quality) sites– The best Toyota fan-club site probably also

has many links pointing to it– Less high-quality sites do not have as many

high-quality sites linking to them

Page 83: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Web CrawlingWeb Crawling

• How do the web search engines get all of the items they index?

• Main idea: – Start with known sites– Record information for these sites– Follow the links from each site– Record information found at new sites– Repeat

Page 84: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Web CrawlersWeb Crawlers

• How do the web search engines get all of the items they index?

• More precisely:– Put a set of known sites on a queue– Repeat the following until the queue is empty:

• Take the first page off of the queue• If this page has not yet been processed:

– Record the information found on this page» Positions of words, links going out, etc

– Add each link on the current page to the queue– Record that this page has been processed

• Depth-first/breadth-first/…

Page 85: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Sites Are Complex Graphs, Sites Are Complex Graphs, Not Just TreesNot Just Trees

Page 1

Page 3Page 2

Page 1

Page 2

Page 1

Page 5

Page 6

Page 4

Page 1

Page 2

Page 1

Page 3

Site 6

Site 5

Site 3

Site 1 Site 2

Page 86: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Web Crawling IssuesWeb Crawling Issues

• Keep out signs– A file called robots.txt tells the crawler which

directories are off limits• Freshness

– Figure out which pages change often– Recrawl these often

• Duplicates, virtual hosts, etc– Convert page contents with a hash function– Compare new pages to the hash table

• Lots of problems– Server unavailable– Incorrect html– Missing links– Infinite loops

• Web crawling is difficult to do robustly!

Page 87: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

XML, IR, … databases?XML, IR, … databases?

Page 88: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

What’s XML?What’s XML?

• W3C Standard since 1998 – Subset of SGML – (ISO Standard Generalized Markup

Language)

• Data-description markup language– HTML text-rendering markup language

• De facto format for data exchange on Internet– Electronic commerce– Business-to-business (B2B) communication

Page 89: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

XML Data Model HighlightsXML Data Model Highlights

• Tagged elements describe semantics of data– Easier to parse for a machine and for a human

• Element may have attributes • Element can contain nested sub-elements• Sub-elements may themselves be tagged

elements or character data• Usually considered as tree structure … but

really a graph!

Page 90: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

An XML DocumentAn XML Document

<? xml version=" 1.0"?><! DOCTYPE sigmodRecord SYSTEM “sigmodRecord. dtd"><sigmodRecord> <issue> <volume> 1</ volume> <number> 1</ number> <articles> <article> <title> XML Research Issues</ title> <initPage> 1</ initPage> <endPage> 5</ endPage> <authors> <author AuthorPosition=" 00"> Tom Hanks</ author> </ authors> </ article> </ articles> </ issue>

Page 91: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

XPATH ExampleXPATH Example

• List the titles of articles in which the author has “Tom Hanks”– //article[//author=“Tom Hanks”]/title

• Find the titles of articles authored by “Tom Hanks” in volume 1.– //issue[/volume=“1”]/articles/article/[//

author=“TomHanks”]/title

Page 92: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

XQuery: ExampleXQuery: Example

List the titles of the articles authored by “Tom Hanks”

Query Expressionfor $b IN

document(“sigmodRecord.xml")//articlewhere $b//author =“Tom Hanks"return <title>$b/title.text()</title>

Query Result<title>XML Research Issues</title>

Page 93: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Data- vs. Document-centricData- vs. Document-centric

• Data-centric– Highly structured– Usage stems from

exchange of data from databases

• Example:

• Document-centric– Semi-structured– Embedded tags– Fulltext search

important– Usage stems from

exchange of formatted text<site>

<item ID=“I001”> <name>Chair</name> <description>This chair is in good condition ... </description> </item> <item ID=“I002”> <name>Table</table> <description>...</description> </item> ...</site>

<memo> <author>John Doe</author> <title>...</title> <body> This memo is meant for all persons responsible for <list bullets=1> <item>either <em>customers</em> abroad, <item>or <em>suppliers</em> abroad. </list> ...</memo>

Page 94: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Classes of XML Documents Classes of XML Documents

• Structured– “Un-normalized” relational data Ex: product catalogs, inventory data,

medical records, network messages, logs, stock quotes

• Mixed – Structured data embedded in large text

fragments Ex: On-line manuals, transcripts, tax forms

Page 95: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Queries on Mixed DataQueries on Mixed Data

• Full-text search operatorsEx: Find all <bill>s where "striking" &

"amended" are within 6 intervening words

• Queries on structure & textEx: Return <text> element containing both

"exemption" & "social security" and preceding & following <text> elements

• Queries that span (ignore) structureEx: Return <bill> that contains “referred to

the Committee on Financial Services”

Page 96: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

XML IndexingXML Indexing

A

B

T1

C

D

T2 T3

E

T4

A

B C

D ET1

T2 T3

Page 97: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

XML Document SchemaXML Document Schema

B

T1

C

D

T2 T3

E

T4

A

Node index

Word index

ID S E P

A 0 0 13 -

B 1 1 3 1

C 2 4 12 1

D 3 5 8 2

… … … … …

T1 2

T2 6

T3 7

Page 98: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Text Region AlgebrasText Region Algebras

A

B

T1

C

D

T2 T3

E

T4

A

B C

D ET1

T2 T3

contains(A, nodeindex)

Page 99: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Text Region AlgebrasText Region Algebras

A

B

T1

C

D

T2 T3

E

T4

A

B C

D ET1

T2 T3

contained-by(D, nodeindex)

Page 100: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Using Region AlgebrasUsing Region Algebras

B

T1

C

D

T2 T3

E

T4

B

T1

C

D

T2 T3

E

T4

roots := contains and not contained-by

A A

Page 101: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Using Region AlgebrasUsing Region Algebras

B

T1

C

D

T2 T3

E

T4

B

T1

C

D

T2 T3

E

T4

leafs := contained-by and not contains

A A

Page 102: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Using Region AlgebrasUsing Region Algebras

B

T1

C

D

T2 T3

E

T4

B

T1

C

D

T2 T3

E

T4

inner := diff(diff(nodes, leafs),roots)

Page 103: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

XML Region Operators - 1XML Region Operators - 1

• Containment– Contains and contained-by

• Length– Length (including markup) and text-length (no

markup)

• Other possibilities include…– Range or distance - enabling references– Overlaps– (Minimal) bounding region (direct parent)

Page 104: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

XML Region Operators – 2XML Region Operators – 2

• Transitive closure– Containment - select on word index with

known lower and upper bound

• Regeneration of XML document– Containment on node index – Containment on word index– Union of node and word subsets

Page 105: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Inverted Lists for XML Inverted Lists for XML DocumentsDocuments

(1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … …

(1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … …

<section>

<title>

(1, 3, 2) … …

(1, 4, 2) … … “retrieval”

“information”

Element index

Text index

<section> <title> Information Retrieval Using RDBMS </title> <section> <title> Beyond Simple Translation </title> <section> <title> Extension of IR Features </title> </section> </section></section>

1

2 3 4 5 6 7

89 10 11 12 13

14

15 16 17 18 19 20

21

22

23

Containment, direct containment, tight containment, proximity

Page 106: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Inverted Lists to Relational Inverted Lists to Relational TablesTables

<section>

“information”

(1, 1:23, 0)

(1; 8:22, 1)

(1; 14:21, 2)

(1, 3, 2) ……

…...section 1 1 23 0

section 1 8 22 1section 1 14 21 2

…...

term docno begin end level

ELEMENTS

term docno wordno level

information 1 3 2

…...

…...

TEXTS

InvertedList

InvertedList

Page 107: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Inverted List Operations to SQL Inverted List Operations to SQL TranslationTranslation-- E // ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and e.begin < t.wordno and t.wordno < e.end

-- E / ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and e.begin < t.wordno and t.wordno < e.end and e.level = t.level - 1

-- E = ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and t.wordno = e.begin + 1 and e.end = t.wordno + 1

-- distance(“T1”, “T2”) <= n select * from TEXTS t1, TEXTS t2 where t1.term = ‘T1’ and t2.term = ‘T2’ and t1.docno = t2.docno and t2.wordno > t1.wordno and t2.wordno <= t1.wordno + n

Page 108: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Data Sets Shakespeare DBLP SyntheticSize of documents 8 MB 53 MB 207 MBinverted index size 11 MB 78 MB 285 MBrelational table size 15 MB 121 MB 566 MBNumber of distinct elements 22 598 715Number of distinct text words 22,825 250,657 100,000Total number of elements 179,726 1,595,010 4,999,500Total number of text words 474,057 3,655,148 19,952,000

Experimental Setup: Experimental Setup: WorkloadWorkload

• Data sets

• 13 micro-benchmark queries:– Of form: ‘elem’ contains ‘word’– Vary ‘elem’ and ‘word’ of different frequencies:

check sensitivity of performance w.r.t. selectivity

Page 109: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Number of Term OccurrencesNumber of Term Occurrences

Queries term1 frequency term2 frequency result rowsQS1 90 277 2QS2 107,833 277 36QS3 107,833 3,231 1,543QS4 107,833 1 1QD1 654 55 13QD2 4,188 712 672QD3 287,513 6,363 6,315QD4 287,513 3 3QG1 50 1,000 809QG2 134,900 55,142 1,470QG3 701,000 165,424 21,936QG4 50 82,712 12QG5 701,000 17 4

Page 110: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

DB2/Inv. List Engine DB2/Inv. List Engine Performance RatiosPerformance Ratios

• Want to know why RDBMS:– sometimes performs much better– usually performs much worse

• Two significant factors: Join algorithm, cache utilization

Page 111: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

‘T’

Why Does RDBMS Sometimes Perform Why Does RDBMS Sometimes Perform Better?Better?

ELEMENTS(or TEXTS)

TEXTS(or ELEMENTS)

indexindex

-- E // ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and e.begin < t.wordno and t.wordno < e.end

Index Nested Loop

‘E’ ‘T’ list‘E’ list

RDBMS Inverted List Engine

More CPU

More I/O

Page 112: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

MPMGJN vs. Standard Merge MPMGJN vs. Standard Merge JoinJoin

d b e5 7 20

5 14 195 21 285 22 275 29 315 32 40

d w

MPMGJN

5 25 235 245 335 37

5 42

Inverted List Engine

d b e5 7 20

5 14 195 21 285 22 275 29 315 32 40

d w

Standard Merge Join

5 25 235 245 335 37

5 42

RDBMS

Three merging predicatesd1 = d2, b <= w, w <= e

One merging predicate: d1 = d2,Additional filtering predicates:b <= w, w <= e

Page 113: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

Number of Row Pairs Number of Row Pairs ComparedComparedQueries d1=d2,b<=w<=e d1=d2

QS1 5 1,653 QS2 7,131 984,948 QS3 89,716 10,175,904 QS4 2,366 3,475 QD1 503 555 QD2 4,723 1,315,662 QD3 263,458 14,082,080 QD4 1,766 4,950 QG1 1,000 1,000 QG2 103,994 148,773,116 QG3 610,816 2,319,244,480 QG4 12 82,712 QG5 56,084 238,340

Page 114: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

MPMGJN vs. Index Nested Loop MPMGJN vs. Index Nested Loop JoinJoin

d b e5 7 20

5 14 195 21 285 22 275 29 315 32 40

d w

MPMGJN

5 25 235 245 335 37

5 42

d b e5 7 20

5 14 195 21 285 22 275 29 315 32 40

d w

Index Nested-loop Join

5 25 235 245 335 37

5 42

Index

Inverted List Engine RDBMS

MPMGJN vs. Index nested-loop join, depends on:

number of index key comparisons vs. records comparisonscost of index key comparisons vs. records comparisons

Page 115: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

RDBMS vs. Special-purpose?RDBMS vs. Special-purpose?

• Recap:– RDBMS’s have great technology (e.g., B+-

tree, optimization, storage)– but not well-tuned for containment queries

using inverted lists

• Open question:– use RDBMS or special purpose engine to

process containment queries using inverted lists?

Page 116: Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam arjen@acm.org

ThanksThanks

• Ron Larson• Marti Hearst• Doug Oard• Philip Resnik• Sriram Raghavan• Johan List• Maurice van Keulen• David Carmel• Aya Soffer• …