implementation issues arjen p. de vries centrum voor wiskunde en informatica amsterdam [email protected]

Implementation IssuesImplementation Issues

Arjen P. de VriesCentrum voor Wiskunde en Informatica

[email protected]

OverviewOverview

• Indexing collections• Some text statistics• File organization

– Inverted files– Signature files

• Search engines• Indexing XML

Supporting the Search Supporting the Search ProcessProcess

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Query Reformulation and

Relevance Feedback

SourceReselection

Nominate ChoosePredict

Supporting the Search Supporting the Search ProcessProcess

SourceSelection

Search

Query

Selection

Ranked List

Examination

Document

Delivery

Document

QueryFormulation

IR System

Indexing Index

Acquisition Collection

Some Questions for TodaySome Questions for Today

• How long will it take to find a document?– Is there any work we can do in advance?

• If so, how long will that take?

• How big a computer will I need?– How much disk space? How much RAM?

• What if more documents arrive?– How much of the advance work must be

repeated?– Will searching become slower?– How much more disk space will be needed?

Text StatisticsText Statistics

Frequent words on the Frequent words on the WWWWWW• 65002930 the

• 62789720 a• 60857930 to• 57248022 of• 54078359 and• 52928506 in• 50686940 s• 49986064 for• 45999001 on• 42205245 this• 41203451 is• 39779377 by• 35439894 with• 35284151 or• 34446866 at• 33528897 all• 31583607 are

• 30998255 from• 30755410 e• 30080013 you• 29669506 be• 29417504 that• 28542378 not• 28162417 an• 28110383 as• 28076530 home• 27650474 it• 27572533 i• 24548796 have• 24420453 if• 24376758 new• 24171603 t• 23951805 your• 23875218 page

• 22292805 about• 22265579 com• 22107392 information• 21647927 will• 21368265 can• 21367950 more• 21102223 has• 20621335 no• 19898015 other• 19689603 one• 19613061 c• 19394862 d• 19279458 m• 19199145 was• 19075253 copyright• 18636563 us

(see http://elib.cs.berkeley.edu/docfreq/docfreq.html)

Plotting Word Frequency by Plotting Word Frequency by RankRank• Main idea:

– Count how many times tokens occur in the text

• Sum over all of the texts in the collection

• Now order these tokens according to how often they occur (highest to lowest)

• This is called the rank

• The product of the frequency of words (f) and their rank (r) is approximately constant– Rank = order of words’ frequency of occurrence

• Another way to state this is with an approximately correct rule of thumb:– Say the most common term occurs C times– The second most common occurs C/2 times– The third most common occurs C/3 times

Zipf DistributionZipf Distribution

10/

/1

NC

rCf

Illustration by Jacob Nielsen

Zipf DistributionZipf Distribution

Linear Scale Logarithmic Scale

What has a Zipf What has a Zipf Distribution?Distribution?• Words in a text collection

– Virtually any use of natural language

• Library book checkout patterns• Incoming Web Page Requests (Nielsen)• Outgoing Web Page Requests (Cunha &

Crovella)• Document Size on Web (Cunha &

Crovella)

Related Distributions/LawsRelated Distributions/Laws

• For more examples, see Robert A. Fairthorne, “Empirical Distributions (Bradford-Zipf-Mandelbrot) for Bibliometric Description and Prediction,” Journal of Documentation, 25(4), 319-341

• Pareto distribution of wealth; Willis taxonomic distribition in biology; Mandelbrot on self-similarity and market prices and communication errors; etc.

Consequences of ZipfConsequences of Zipf

• There are always a few very frequent tokens that are not good discriminators.– Called “stop words” in IR– Usually correspond to linguistic notion of “closed-

class” words• English examples: to, from, on, and, the, ...• Grammatical classes that don’t take on new members.

• There are always a large number of tokens that occur once (and can have unexpected consequences with some IR algorithms)

• Medium frequency words most descriptive

Word Frequency vs. Word Frequency vs. Resolving PowerResolving Power

The most frequent words are not the most descriptive.

(from Van Rijsbergen 79)

File OrganizationFile Organization

File Structures for IRFile Structures for IR

• Lexicographical indices– indices that are sorted– inverted files– Patricia (PAT) trees (Suffix trees and arrays)

• Indices based on hashing– signature files

Document Vectors Document Vectors ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3

“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)


“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I

Inverted IndexInverted Index

• The primary data structure for text indexes

• Main Idea:– Invert documents into a big index

• Basic steps:– Make a “dictionary” of all the tokens in the

collection– For each token, list all the docs it occurs in.– Do a few things to reduce redundancy in the

data structure

Inverted IndexesInverted Indexes

• An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

How Inverted Files Are How Inverted Files Are CreatedCreated• Documents are parsed to extract

tokens• Save token with Document ID

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

What are the tokens?What are the tokens?

• Stemming?• Case folding?• Thesauri and/or soundex?• Spelling correction?

How Inverted Files are CreatedHow Inverted Files are Created

• After all documents have been parsed the inverted file is sorted alphabetically

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2


• Multiple term entries for a single document are merged

• Within-document term frequency information is compiled

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2


Dictionary PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

SearchingSearching

• Vocabulary search– Identify the words and patterns in the query– Search them in the vocabulary

• Retrieval of occurrences– Retrieve the lists of occurrences of all the

words

• Manipulation of occurrences– Solve phrases, proximity, or Boolean

operations– Find the exact word positions when block

addressing is used

Th

ree

gen

eral

st

eps

How Inverted Files are UsedHow Inverted Files are Used

Dictionary PostingsDoc # Freq

2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Boolean Query on “time” AND “dark”

2 docs with “time” in dictionary ->IDs 1 and 2 from posting file

1 doc with “dark” in dictionary ->ID 2 from posting file

Therefore, only doc 2 satisfied the query.

Query optimizationQuery optimization

• Consider a query that is an AND of t terms.

• The idea: for each of the t terms, get its term-doc incidence from the postings, then AND together.

• Process in order of increasing freq:– start with smallest set,

then keep cutting further.

Keep tfin dictionary!

Processing for ranking Processing for ranking systemssystems

For each query termfor each document in inverted list

augment similarity coefficient

For each documentfinish calculation of similarity coefficient

Perform sort of similarity coefficientsRetrieve and present document

Processing Vector Space Processing Vector Space Queries (I)Queries (I)Set A {}. A is the set of accumulators

For each query term t Q,Stem tSearch the lexicon, record ft and the address

of ItSet wt 1 + loge (N/ft)

Processing VS Queries IIProcessing VS Queries II

Read the inverted file entry, ItFor each (d,fd,t) pair in It

If Ad A thenset Ad 0set A A + {Ad}

Set Ad A + loge (1 + fd,t)*wt

For each Ad ASet Ad Ad/Wd (where Wd is weight of document D)

AAd is now proportional to the value cosine(Q,Dd)

How Large is the Dictionary? How Large is the Dictionary?

#docs

#uni

que

term

s

1.0e6

2.0e6

• Heaps’ law: – the vocabulary grows as O(n), : 0.4~0.6

Term Storage in DictionaryTerm Storage in Dictionary

• Store pointers to every kth on term string.

• Need to store term lengths (1 extra byte)….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33

29

44

126

7

Save 9 bytes on 3 pointers.

Lose 4 bytes onterm lengths.

Searching the DictionarySearching the Dictionary

• Sorted Arrays– store the list of keywords in a sorted array– using a standard binary search– advantage: easy to implement– disadvantage: updating the index is

expensive

• Hashing Structures • B-Trees, Tries, Pat(ricia) Trees, …• Combinations of these structures

The lexicon as a Hash TableThe lexicon as a Hash Table

• Store terms in a Hash table:– Map the M terms into a set of m integer values 0 <=

h(ti) <= m-1 (m << M).– A simple hash function for strings:

– Retrieval time: O(|t|)– Cons:

• collisions vs. space• Cannot handle wildcards

mwttht

iii mod )(

1

TriesTries

• A structure constructed from strings• Name comes from the word retrieval• Used for string search in normal text• Each edge represents a character of the

string including the terminator $• Leaves contain the keys

TRIE - Non-compactTRIE - Non-compact

Consider 5 character strings:

BIG BIGGERBILLGOODGONG

Each character stored in one edge of the tree

E

B G

I

LG

BIG$

BIGGER$

BILL$GOOD$ GONG$

L

$

$

$

$ $

O

O

D

N

GG

R

TRIE - PropertiesTRIE - Properties

• An internal node can have from 1 to d children where d is the size of the alphabet.

• A path from the root of T to an internal node i corresponds to an i-character prefix of a string S

• The height of the tree is the length of the longest string

• If there are S unique strings, T has S external nodes

• Looking up a string of length M is O(M)

Compact TrieCompact TrieRemove chains leading to leaves

E

B G

I

LG

BIG$

BIGGER$

BILL$GOOD$ GONG$

L

$

$

$

$ $

O

O

D

N

GG

R

B G

I

LG

BIG$BIGGER$

BILL$ GOOD$ GONG$

$

O

O N

G

PATRICIA (PAT TREE)PATRICIA (PAT TREE)PPractical AAlgorithm TTo RRetrieve IInformation CCoded IIn AAlphabets -

introduced by D.R. Morrison in October 1968Collapse all unary nodes

B G

I

LG

BIG$BIGGER$

BILL$ GOOD$ GONG$

$

O

O N

G

BI GO

LG

BIG$BIGGER$

BILL$

GOOD$ GONG$$

O N

G

PATRICIA - CommentsPATRICIA - Comments

• Number of nodes proportional to number of strings

• Strings on edges become variable length• Use an auxiliary data structure to actually

store the strings• The TRIE only stores triplets of numbers

indicating where in the auxiliary data structure to look

Partially specified query (*)Partially specified query (*)

• Using n-grams– E.g., decompose labor into la-ab-bo-or– Mark begin/end of word with special char

($)– Labor ->$l-la-ab-bo-or-r$

• Answers to lab*r are obtained via$l and la and ab and r$

Use of bigramsUse of bigrams

Number Term

12345678

AbhorBearLaaberLaborLaboratorLabourLavacaberSlab

Bigram Term id

$a$b$l$saaabbolaorouraryr$sl

123,4,5,6,7831,3,4,5,6,7,84,5,63,4,5,6,7,81,4,56551,2,3,4,5,6,78

Problems with bi-gramsProblems with bi-grams

• Query $l and la and ab and r$– might be false match: – In example, will match laaber and lavacaber (from

TREC collection).. – Why: all bigrams are present, but not consecutive

• What about labrador?– Right answer but not what was intended– Wildcards with n-grams are risky!

• False matches can be checked with additional exact pattern matcher

Can we support wildcards with Can we support wildcards with TRIETRIE

• Easy to answer queries with wildcard tails– E.g., lab*, labor*, …

• What about lab*r or *bour?• Use a rotated lexicon

– Store: $labor, abor$l, bor$la, or$lab, r$labo, $labor

• To search for lab*r– rotate until wildcard is at end: r$lab* then– search for strings that have this prefix

Rotated LexiconRotated Lexicon

id Term

12345678

AbhorBearLaaberLaborLaboratorLabourLavacaberSlab

Rotated Add

$abhor$bear$laaber$labor$laborator$labour$lavacaber$slabaaber$labhor$aber$laabor$l

(1,0)(2,0)(3,0)(4,0)(5,0)(6,0)(7,0)(8,0)(3,2)(1,1)(3,3)(4,2)

Rotated Add

aborator$labour$laber$lavacab$slr$abhor$bear$laaber$labor$laborator$labour$lavacabeslab$

(5,2)(6,2)(7,6)(8,3)(1,5)(2,4)(3,6)(4,5)(5,9)(6,6)(7,9)(8,1)

What Goes in a Postings What Goes in a Postings File?File?• Boolean retrieval

– Just the document number

• Ranked Retrieval– Document number and term weight (TF*IDF,

...)

• Proximity operators– Word offsets for each occurrence of the

term• Example: Doc 3 (t17, t36), Doc 13 (t3, t45)

How Big Is the Postings File?How Big Is the Postings File?

• Very compact for Boolean retrieval– About 10% of the size of the documents

• If an aggressive stopword list is used!

• Not much larger for ranked retrieval– Perhaps 20%

• Enormous for proximity operators– Sometimes larger than the documents!

Indexing Small

(1 MB)

collection Medium

(200MB)

collection Large

(2GB)

Collection

Addressing

words

45% 73% 36% 64% 35% 63%

Addressing

documents

19% 26% 18% 32% 26% 47%

Addressing

64k blocks

27% 41% 18% 32% 5% 9%

Addressing

256 blocks

18% 25% 1.7% 2.4% 0.5% 0.7%

Full inversion (all words, exact positions, 4-byte pointers)

2 or 1 byte(s) per pointer independent of the text sizedocument size (10KB), 1, 2, 3 bytes per pointer, depending on text size

All words are indexedStop words are not indexed

Inverted File CompressionInverted File Compression

• Key idea, each postings list can be stored as an ascending sequence of integers

• Example: if t occurs in 8 documents, list the postings in ascending doc id– <8;3,5,20,21,23,76,77,78>

• Storing the gaps would change to <8; [3,2,15,1,2,53,1,1]>


• More formally, if t appears in nt docs– <nt;d1,d2,…,dft>

– where dk < dk+1

• The postings can be equivalently stored as an initial position followed by a list of d-gaps: dk+1-dk

• Example– From <8;3,5,20,21,23,76,77,78> – To <8;3,2,15,1,2,53,1,1>


• No information loss, but …… where does the saving come from?– The largest d-gap in the 2nd rep is

potentially the same as the largest document id in the 1st

– If there are N documents, and a fixed binary encoding is used, both methods require log N bits per any stored pointer…

IntuitionIntuition

• Frequent words have d-gaps significantly smaller than rare words

• Variable length representations are considered, in which small values are more likely and coded more economically than large one

• Various methods based on probability distribution of index numbers.

BitmapsBitmaps

• For every term in the lexicon, a bitvector is stored, each bit corresponding to a document. Set to 1 if term appears in document, set to 0 otherwise

• Thus if term “pot” appears in documents 2 and 5, the corresponding bits are set– pot -> 0100100

• Bitmaps are efficient for Boolean queries– Answers to a query are obtained by simply

combining bitvectors with Boolean operators

Example of bitmapsExample of bitmapsterm bitvector

cold 100100

days 001001

hot 100100

in 010010

it 010010

like 000110

nine 001001

old 001001

pease 110000

porridge 110000

pot 010010

some 000110

the 010010

Query “some” AND “pot”

Answer: 010010 AND 000110Yields: 000010d5 unique answer

What Are Signature Files?What Are Signature Files?

In your email programme you will find the option “signatures”. For example in outlook express go to “tools”, “options” and you will see “signatures”.

Signature files offer you as a person or business to state contact information at the end of an e-mail message. They are also known as "sig. files". Using sig. files is an opportunity to advertise is free.

Signature FilesSignature Files

Signature FilesSignature Files

• Signature files were popular in the past because they are implicitly compressed (and take less space than uncompressed inverted files)

• A signature file is a probabilistic method for indexing text: each document has an associated signature or descriptor, a string of bits that captures the content of the document

Not a new idea…Not a new idea…

Word signaturesWord signatures

• Assume that words have 16 bit descriptors constructed by taking 3 hash functions generating values between 1…16, and setting corresponding bits in a 16 bit descriptor

• Example– if 3 hashing values for cold are 3,5,16– Cold -> 1000 0000 0010 0100

• Very unlikely but not impossible that signatures of 2 words are identical

Example for given lexiconExample for given lexiconterm Hash string

cold 1000 0000 0010 0100

days 0010 0100 0000 1000

hot 0000 1010 0000 0000

in 0000 1001 0010 0000

it 0000 1000 1000 0010

like 0100 0010 0000 0001

nine 0010 1000 0000 0100

old 1000 1000 0100 0000

pease 0100 0100 0010 0000

porridge 0100 0100 0010 0000

pot 0000 0010 0110 0000

some 0100 0100 0000 0001

the 1010 1000 0000 0000

Document signaturesDocument signatures

• Signatures for each word in a given document are superimposed– The words signatures of each word in a

document d are ORed, to form the document signature

– Example • document 6: nine days old

ExampleExample

• nine 0010 1000 0000 0100• days 0010 0100 0000 1000• old 1000 1000 0100 0000• d6 “nine days old” 1010 1100 0100 1100

• To test whether a word is in a document, calculate its signature. If the corresponding bits are set in the document, there is a high probability that the document contains the word.

False hitsFalse hits

• Reduce false hits by increasing number of bits set for each term and increase length of signature

• Yet need to fetch the document at query time, and scan/parse/stem it to check that the term really occurs: costs time

• Effective when queries are long (lower probabilities of false hits)

Example of false hitsExample of false hits

Doc signature Text

1 1100 1111 0010 0101 Pease porridge hot, please porridge cold.

2 1110 1111 0110 0001 Pease porridge in the pot.

3 1010 1100 0100 1100 Nine days old.

4 1100 1110 1010 0111 Some like it hot, some like it cold

5 1110 1111 1110 0011 Some like it in the pot

6 1010 1100 0100 1100 Nine days old

cold 1000 0000 0010 0100old 1000 1000 0100 0000

Real matches?

1 and 42, 3, 5 and 6

Merits of Signature FilesMerits of Signature Files

• Faster than full text scanning– 1 or 2 orders of magnitude faster– But… still linear in collection size

• Modest space overhead– 10-15% vs. 50-300% (inversion)

• Insertions can be handled easily– append only– no reorganization or rewriting

Comparison of indexing Comparison of indexing methodsmethods

• Bitmap consumes an order of magnitude more storage than inverted files and signature files

• Signature files require un-necessary access to the main text because of false matches

• Signature file main advantage: no in-memory lexicon

• Compressed inverted files are the state-of-the art index structure used by most search engines

Search EnginesSearch Engines

Information from searchenginewatch.com

Searches Per Day (2000)Searches Per Day (2000)

Searches Per Day (2001)Searches Per Day (2001)

Challenges for Web SearchingChallenges for Web Searching

• Distributed data• Volatile data/”Freshness”: 40% of the web

changes every month• Exponential growth• Unstructured and redundant data: 30% of web

pages are near duplicates• Unedited data• Multiple formats• Commercial biases• Hidden data

From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Standard Web Search Engine ArchitectureStandard Web Search Engine Architecture

crawl theweb

create an inverted

index

Check for duplicates,store the

documents

Inverted index

Search engine servers

userquery

Show results To user

DocIds

More detailed architecture,

from Brin & Page 98.

Only covers the preprocessing in

detail, not the query serving.

Indexes for Web Search EnginesIndexes for Web Search Engines

• Inverted indexes are still used, even though the web is so huge

• Some systems partition the indexes across different machines– Each machine handles different parts of the

data

• Other systems duplicate the data across many machines– Queries are distributed among the machines

• Most do a combination of these

In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries.

Each row can handle 120 queries per second

Each column can handle 7M pages

To handle more queries, add another row.

From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

Querying: Cascading Allocation of CPUsQuerying: Cascading Allocation of CPUs

• A variation on this that produces a cost-savings:– Put high-quality/common pages on many

machines– Put lower quality/less common pages on

fewer machines– Query goes to high quality machines first– If no hits found there, go to other machines

Google Google

• Google maintains the worlds largest Linux cluster (10,000 servers)

• These are partitioned between index servers and page servers– Index servers resolve the queries (massively

parallel processing)– Page servers deliver the results of the

queries

• Over 3 Billion web pages are indexed and served by Google

Web Page RankingWeb Page Ranking

• Varies by search engine– Pretty messy in many cases– Details usually proprietary and fluctuating

• Combining subsets of:– Term frequencies– Term proximities– Term position (title, top of page, etc)– Term characteristics (boldface, capitalized, etc)– Link analysis information– Category information– Popularity information

Ranking: Link AnalysisRanking: Link Analysis

• Assumptions:– If the pages pointing to this page are good,

then this is also a good page– The words on the links pointing to this page

are useful indicators of what this page is about

– References: Page et al. 98, Kleinberg 98

Ranking: Link AnalysisRanking: Link Analysis

• Why does this work?– The official Toyota site will be linked to by

lots of other official (or high-quality) sites– The best Toyota fan-club site probably also

has many links pointing to it– Less high-quality sites do not have as many

high-quality sites linking to them

Web CrawlingWeb Crawling

• How do the web search engines get all of the items they index?

• Main idea: – Start with known sites– Record information for these sites– Follow the links from each site– Record information found at new sites– Repeat

Web CrawlersWeb Crawlers

• How do the web search engines get all of the items they index?

• More precisely:– Put a set of known sites on a queue– Repeat the following until the queue is empty:

• Take the first page off of the queue• If this page has not yet been processed:

– Record the information found on this page» Positions of words, links going out, etc

– Add each link on the current page to the queue– Record that this page has been processed

• Depth-first/breadth-first/…

Sites Are Complex Graphs, Sites Are Complex Graphs, Not Just TreesNot Just Trees

Page 1

Page 3Page 2

Page 1

Page 2

Page 1

Page 5

Page 6

Page 4

Page 1

Page 2

Page 1

Page 3

Site 6

Site 5

Site 3

Site 1 Site 2

Web Crawling IssuesWeb Crawling Issues

• Keep out signs– A file called robots.txt tells the crawler which

directories are off limits• Freshness

– Figure out which pages change often– Recrawl these often

• Duplicates, virtual hosts, etc– Convert page contents with a hash function– Compare new pages to the hash table

• Lots of problems– Server unavailable– Incorrect html– Missing links– Infinite loops

• Web crawling is difficult to do robustly!

XML, IR, … databases?XML, IR, … databases?

What’s XML?What’s XML?

• W3C Standard since 1998 – Subset of SGML – (ISO Standard Generalized Markup

Language)

• Data-description markup language– HTML text-rendering markup language

• De facto format for data exchange on Internet– Electronic commerce– Business-to-business (B2B) communication

XML Data Model HighlightsXML Data Model Highlights

• Tagged elements describe semantics of data– Easier to parse for a machine and for a human

• Element may have attributes • Element can contain nested sub-elements• Sub-elements may themselves be tagged

elements or character data• Usually considered as tree structure … but

really a graph!

An XML DocumentAn XML Document

<? xml version=" 1.0"?><! DOCTYPE sigmodRecord SYSTEM “sigmodRecord. dtd"><sigmodRecord> <issue> <volume> 1</ volume> <number> 1</ number> <articles> <article> <title> XML Research Issues</ title> <initPage> 1</ initPage> <endPage> 5</ endPage> <authors> <author AuthorPosition=" 00"> Tom Hanks</ author> </ authors> </ article> </ articles> </ issue>

XPATH ExampleXPATH Example

• List the titles of articles in which the author has “Tom Hanks”– //article[//author=“Tom Hanks”]/title

• Find the titles of articles authored by “Tom Hanks” in volume 1.– //issue[/volume=“1”]/articles/article/[//

author=“TomHanks”]/title

XQuery: ExampleXQuery: Example

List the titles of the articles authored by “Tom Hanks”

Query Expressionfor $b IN

document(“sigmodRecord.xml")//articlewhere $b//author =“Tom Hanks"return <title>$b/title.text()</title>

Query Result<title>XML Research Issues</title>

Data- vs. Document-centricData- vs. Document-centric

• Data-centric– Highly structured– Usage stems from

exchange of data from databases

• Example:

• Document-centric– Semi-structured– Embedded tags– Fulltext search

important– Usage stems from

exchange of formatted text<site>

<item ID=“I001”> <name>Chair</name> <description>This chair is in good condition ... </description> </item> <item ID=“I002”> <name>Table</table> <description>...</description> </item> ...</site>

<memo> <author>John Doe</author> <title>...</title> <body> This memo is meant for all persons responsible for <list bullets=1> <item>either <em>customers</em> abroad, <item>or <em>suppliers</em> abroad. </list> ...</memo>

Classes of XML Documents Classes of XML Documents

• Structured– “Un-normalized” relational data Ex: product catalogs, inventory data,

medical records, network messages, logs, stock quotes

• Mixed – Structured data embedded in large text

fragments Ex: On-line manuals, transcripts, tax forms

Queries on Mixed DataQueries on Mixed Data

• Full-text search operatorsEx: Find all <bill>s where "striking" &

"amended" are within 6 intervening words

• Queries on structure & textEx: Return <text> element containing both

"exemption" & "social security" and preceding & following <text> elements

• Queries that span (ignore) structureEx: Return <bill> that contains “referred to

the Committee on Financial Services”

XML IndexingXML Indexing

A

B

T1

C

D

T2 T3

E

T4

A

B C

D ET1

T2 T3

XML Document SchemaXML Document Schema

B

T1

C

D

T2 T3

E

T4

A

Node index

Word index

ID S E P

A 0 0 13 -

B 1 1 3 1

C 2 4 12 1

D 3 5 8 2

… … … … …

T1 2

T2 6

T3 7

Text Region AlgebrasText Region Algebras

A

B

T1

C

D

T2 T3

E

T4

A

B C

D ET1

T2 T3

contains(A, nodeindex)

Text Region AlgebrasText Region Algebras

A

B

T1

C

D

T2 T3

E

T4

A

B C

D ET1

T2 T3

contained-by(D, nodeindex)

Using Region AlgebrasUsing Region Algebras

B

T1

C

D

T2 T3

E

T4

B

T1

C

D

T2 T3

E

T4

roots := contains and not contained-by

A A


B

T1

C

D

T2 T3

E

T4

B

T1

C

D

T2 T3

E

T4

leafs := contained-by and not contains

A A


B

T1

C

D

T2 T3

E

T4

B

T1

C

D

T2 T3

E

T4

inner := diff(diff(nodes, leafs),roots)

XML Region Operators - 1XML Region Operators - 1

• Containment– Contains and contained-by

• Length– Length (including markup) and text-length (no

markup)

• Other possibilities include…– Range or distance - enabling references– Overlaps– (Minimal) bounding region (direct parent)

XML Region Operators – 2XML Region Operators – 2

• Transitive closure– Containment - select on word index with

known lower and upper bound

• Regeneration of XML document– Containment on node index – Containment on word index– Union of node and word subsets

Inverted Lists for XML Inverted Lists for XML DocumentsDocuments

(1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … …

(1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … …

<section>

<title>

(1, 3, 2) … …

(1, 4, 2) … … “retrieval”

“information”

Element index

Text index

<section> <title> Information Retrieval Using RDBMS </title> <section> <title> Beyond Simple Translation </title> <section> <title> Extension of IR Features </title> </section> </section></section>

1

2 3 4 5 6 7

89 10 11 12 13

14

15 16 17 18 19 20

21

22

23

Containment, direct containment, tight containment, proximity

Inverted Lists to Relational Inverted Lists to Relational TablesTables

<section>

“information”

(1, 1:23, 0)

(1; 8:22, 1)

(1; 14:21, 2)

(1, 3, 2) ……

…...section 1 1 23 0

section 1 8 22 1section 1 14 21 2

…...

term docno begin end level

ELEMENTS

term docno wordno level

information 1 3 2

…...

…...

TEXTS

InvertedList

InvertedList

Inverted List Operations to SQL Inverted List Operations to SQL TranslationTranslation-- E // ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and e.begin < t.wordno and t.wordno < e.end

-- E / ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and e.begin < t.wordno and t.wordno < e.end and e.level = t.level - 1

-- E = ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and t.wordno = e.begin + 1 and e.end = t.wordno + 1

-- distance(“T1”, “T2”) <= n select * from TEXTS t1, TEXTS t2 where t1.term = ‘T1’ and t2.term = ‘T2’ and t1.docno = t2.docno and t2.wordno > t1.wordno and t2.wordno <= t1.wordno + n

Data Sets Shakespeare DBLP SyntheticSize of documents 8 MB 53 MB 207 MBinverted index size 11 MB 78 MB 285 MBrelational table size 15 MB 121 MB 566 MBNumber of distinct elements 22 598 715Number of distinct text words 22,825 250,657 100,000Total number of elements 179,726 1,595,010 4,999,500Total number of text words 474,057 3,655,148 19,952,000

Experimental Setup: Experimental Setup: WorkloadWorkload

• Data sets

• 13 micro-benchmark queries:– Of form: ‘elem’ contains ‘word’– Vary ‘elem’ and ‘word’ of different frequencies:

check sensitivity of performance w.r.t. selectivity

Number of Term OccurrencesNumber of Term Occurrences

Queries term1 frequency term2 frequency result rowsQS1 90 277 2QS2 107,833 277 36QS3 107,833 3,231 1,543QS4 107,833 1 1QD1 654 55 13QD2 4,188 712 672QD3 287,513 6,363 6,315QD4 287,513 3 3QG1 50 1,000 809QG2 134,900 55,142 1,470QG3 701,000 165,424 21,936QG4 50 82,712 12QG5 701,000 17 4

DB2/Inv. List Engine DB2/Inv. List Engine Performance RatiosPerformance Ratios

• Want to know why RDBMS:– sometimes performs much better– usually performs much worse

• Two significant factors: Join algorithm, cache utilization

‘T’

Why Does RDBMS Sometimes Perform Why Does RDBMS Sometimes Perform Better?Better?

ELEMENTS(or TEXTS)

TEXTS(or ELEMENTS)

indexindex

-- E // ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and e.begin < t.wordno and t.wordno < e.end

Index Nested Loop

‘E’ ‘T’ list‘E’ list

RDBMS Inverted List Engine

More CPU

More I/O

MPMGJN vs. Standard Merge MPMGJN vs. Standard Merge JoinJoin

d b e5 7 20

5 14 195 21 285 22 275 29 315 32 40

d w

MPMGJN

5 25 235 245 335 37

5 42

Inverted List Engine

d b e5 7 20

5 14 195 21 285 22 275 29 315 32 40

d w

Standard Merge Join

5 25 235 245 335 37

5 42

RDBMS

Three merging predicatesd1 = d2, b <= w, w <= e

One merging predicate: d1 = d2,Additional filtering predicates:b <= w, w <= e

Number of Row Pairs Number of Row Pairs ComparedComparedQueries d1=d2,b<=w<=e d1=d2

QS1 5 1,653 QS2 7,131 984,948 QS3 89,716 10,175,904 QS4 2,366 3,475 QD1 503 555 QD2 4,723 1,315,662 QD3 263,458 14,082,080 QD4 1,766 4,950 QG1 1,000 1,000 QG2 103,994 148,773,116 QG3 610,816 2,319,244,480 QG4 12 82,712 QG5 56,084 238,340

MPMGJN vs. Index Nested Loop MPMGJN vs. Index Nested Loop JoinJoin

d b e5 7 20

5 14 195 21 285 22 275 29 315 32 40

d w

MPMGJN

5 25 235 245 335 37

5 42

d b e5 7 20

5 14 195 21 285 22 275 29 315 32 40

d w

Index Nested-loop Join

5 25 235 245 335 37

5 42

Index

Inverted List Engine RDBMS

MPMGJN vs. Index nested-loop join, depends on:

number of index key comparisons vs. records comparisonscost of index key comparisons vs. records comparisons

RDBMS vs. Special-purpose?RDBMS vs. Special-purpose?

• Recap:– RDBMS’s have great technology (e.g., B+-

tree, optimization, storage)– but not well-tuned for containment queries

using inverted lists

• Open question:– use RDBMS or special purpose engine to

process containment queries using inverted lists?

ThanksThanks

• Ron Larson• Marti Hearst• Doug Oard• Philip Resnik• Sriram Raghavan• Johan List• Maurice van Keulen• David Carmel• Aya Soffer• …

implementation issues arjen p. de vries centrum voor wiskunde en informatica amsterdam [email protected]

Documents

frequency of words f

stop words

text statisticsfrequent

descriptiveword frequency

frequent tokens

times tokens

htmlplotting word frequency

text collectionvirtually