implementation issues arjen p. de vries centrum voor wiskunde en informatica amsterdam [email protected]
TRANSCRIPT
Implementation IssuesImplementation Issues
Arjen P. de VriesCentrum voor Wiskunde en Informatica
OverviewOverview
• Indexing collections• Some text statistics• File organization
– Inverted files– Signature files
• Search engines• Indexing XML
Supporting the Search Supporting the Search ProcessProcess
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Query Reformulation and
Relevance Feedback
SourceReselection
Nominate ChoosePredict
Supporting the Search Supporting the Search ProcessProcess
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Indexing Index
Acquisition Collection
Some Questions for TodaySome Questions for Today
• How long will it take to find a document?– Is there any work we can do in advance?
• If so, how long will that take?
• How big a computer will I need?– How much disk space? How much RAM?
• What if more documents arrive?– How much of the advance work must be
repeated?– Will searching become slower?– How much more disk space will be needed?
Text StatisticsText Statistics
Frequent words on the Frequent words on the WWWWWW• 65002930 the
• 62789720 a• 60857930 to• 57248022 of• 54078359 and• 52928506 in• 50686940 s• 49986064 for• 45999001 on• 42205245 this• 41203451 is• 39779377 by• 35439894 with• 35284151 or• 34446866 at• 33528897 all• 31583607 are
• 30998255 from• 30755410 e• 30080013 you• 29669506 be• 29417504 that• 28542378 not• 28162417 an• 28110383 as• 28076530 home• 27650474 it• 27572533 i• 24548796 have• 24420453 if• 24376758 new• 24171603 t• 23951805 your• 23875218 page
• 22292805 about• 22265579 com• 22107392 information• 21647927 will• 21368265 can• 21367950 more• 21102223 has• 20621335 no• 19898015 other• 19689603 one• 19613061 c• 19394862 d• 19279458 m• 19199145 was• 19075253 copyright• 18636563 us
(see http://elib.cs.berkeley.edu/docfreq/docfreq.html)
Plotting Word Frequency by Plotting Word Frequency by RankRank• Main idea:
– Count how many times tokens occur in the text
• Sum over all of the texts in the collection
• Now order these tokens according to how often they occur (highest to lowest)
• This is called the rank
• The product of the frequency of words (f) and their rank (r) is approximately constant– Rank = order of words’ frequency of occurrence
• Another way to state this is with an approximately correct rule of thumb:– Say the most common term occurs C times– The second most common occurs C/2 times– The third most common occurs C/3 times
Zipf DistributionZipf Distribution
10/
/1
NC
rCf
Illustration by Jacob Nielsen
Zipf DistributionZipf Distribution
Linear Scale Logarithmic Scale
What has a Zipf What has a Zipf Distribution?Distribution?• Words in a text collection
– Virtually any use of natural language
• Library book checkout patterns• Incoming Web Page Requests (Nielsen)• Outgoing Web Page Requests (Cunha &
Crovella)• Document Size on Web (Cunha &
Crovella)
Related Distributions/LawsRelated Distributions/Laws
• For more examples, see Robert A. Fairthorne, “Empirical Distributions (Bradford-Zipf-Mandelbrot) for Bibliometric Description and Prediction,” Journal of Documentation, 25(4), 319-341
• Pareto distribution of wealth; Willis taxonomic distribition in biology; Mandelbrot on self-similarity and market prices and communication errors; etc.
Consequences of ZipfConsequences of Zipf
• There are always a few very frequent tokens that are not good discriminators.– Called “stop words” in IR– Usually correspond to linguistic notion of “closed-
class” words• English examples: to, from, on, and, the, ...• Grammatical classes that don’t take on new members.
• There are always a large number of tokens that occur once (and can have unexpected consequences with some IR algorithms)
• Medium frequency words most descriptive
Word Frequency vs. Word Frequency vs. Resolving PowerResolving Power
The most frequent words are not the most descriptive.
(from Van Rijsbergen 79)
File OrganizationFile Organization
File Structures for IRFile Structures for IR
• Lexicographical indices– indices that are sorted– inverted files– Patricia (PAT) trees (Suffix trees and arrays)
• Indices based on hashing– signature files
Document Vectors Document Vectors ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Nova” occurs 10 times in text A“Galaxy” occurs 5 times in text A“Heat” occurs 3 times in text A(Blank means 0 occurrences.)
Document Vectors Document Vectors ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
“Hollywood” occurs 7 times in text I“Film” occurs 5 times in text I“Diet” occurs 1 time in text I“Fur” occurs 3 times in text I
Document Vectors Document Vectors ID nova galaxy heat h'wood film role diet furA 10 5 3B 5 10C 10 8 7D 9 10 5E 10 10F 9 10G 5 7 9H 6 10 2 8I 7 5 1 3
Inverted IndexInverted Index
• The primary data structure for text indexes
• Main Idea:– Invert documents into a big index
• Basic steps:– Make a “dictionary” of all the tokens in the
collection– For each token, list all the docs it occurs in.– Do a few things to reduce redundancy in the
data structure
Inverted IndexesInverted Indexes
• An Inverted File is a vector file “inverted” so that rows become columns and columns become rows
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
How Inverted Files Are How Inverted Files Are CreatedCreated• Documents are parsed to extract
tokens• Save token with Document ID
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
What are the tokens?What are the tokens?
• Stemming?• Case folding?• Thesauri and/or soundex?• Spelling correction?
How Inverted Files are CreatedHow Inverted Files are Created
• After all documents have been parsed the inverted file is sorted alphabetically
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
How Inverted Files are CreatedHow Inverted Files are Created
• Multiple term entries for a single document are merged
• Within-document term frequency information is compiled
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Term Doc #a 2aid 1all 1and 2come 1country 1country 2dark 2for 1good 1in 2is 1it 2manor 2men 1midnight 2night 2now 1of 1past 2stormy 2the 1the 1the 2the 2their 1time 1time 2to 1to 1was 2was 2
How Inverted Files are CreatedHow Inverted Files are Created
Dictionary PostingsTerm Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
SearchingSearching
• Vocabulary search– Identify the words and patterns in the query– Search them in the vocabulary
• Retrieval of occurrences– Retrieve the lists of occurrences of all the
words
• Manipulation of occurrences– Solve phrases, proximity, or Boolean
operations– Find the exact word positions when block
addressing is used
Th
ree
gen
eral
st
eps
How Inverted Files are UsedHow Inverted Files are Used
Dictionary PostingsDoc # Freq
2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Boolean Query on “time” AND “dark”
2 docs with “time” in dictionary ->IDs 1 and 2 from posting file
1 doc with “dark” in dictionary ->ID 2 from posting file
Therefore, only doc 2 satisfied the query.
Query optimizationQuery optimization
• Consider a query that is an AND of t terms.
• The idea: for each of the t terms, get its term-doc incidence from the postings, then AND together.
• Process in order of increasing freq:– start with smallest set,
then keep cutting further.
Keep tfin dictionary!
Processing for ranking Processing for ranking systemssystems
For each query termfor each document in inverted list
augment similarity coefficient
For each documentfinish calculation of similarity coefficient
Perform sort of similarity coefficientsRetrieve and present document
Processing Vector Space Processing Vector Space Queries (I)Queries (I)Set A {}. A is the set of accumulators
For each query term t Q,Stem tSearch the lexicon, record ft and the address
of ItSet wt 1 + loge (N/ft)
Processing VS Queries IIProcessing VS Queries II
Read the inverted file entry, ItFor each (d,fd,t) pair in It
If Ad A thenset Ad 0set A A + {Ad}
Set Ad A + loge (1 + fd,t)*wt
For each Ad ASet Ad Ad/Wd (where Wd is weight of document D)
AAd is now proportional to the value cosine(Q,Dd)
How Large is the Dictionary? How Large is the Dictionary?
#docs
#uni
que
term
s
1.0e6
2.0e6
• Heaps’ law: – the vocabulary grows as O(n), : 0.4~0.6
Term Storage in DictionaryTerm Storage in Dictionary
• Store pointers to every kth on term string.
• Need to store term lengths (1 extra byte)….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
7
Save 9 bytes on 3 pointers.
Lose 4 bytes onterm lengths.
Searching the DictionarySearching the Dictionary
• Sorted Arrays– store the list of keywords in a sorted array– using a standard binary search– advantage: easy to implement– disadvantage: updating the index is
expensive
• Hashing Structures • B-Trees, Tries, Pat(ricia) Trees, …• Combinations of these structures
The lexicon as a Hash TableThe lexicon as a Hash Table
• Store terms in a Hash table:– Map the M terms into a set of m integer values 0 <=
h(ti) <= m-1 (m << M).– A simple hash function for strings:
– Retrieval time: O(|t|)– Cons:
• collisions vs. space• Cannot handle wildcards
mwttht
iii mod )(
1
TriesTries
• A structure constructed from strings• Name comes from the word retrieval• Used for string search in normal text• Each edge represents a character of the
string including the terminator $• Leaves contain the keys
TRIE - Non-compactTRIE - Non-compact
Consider 5 character strings:
BIG BIGGERBILLGOODGONG
Each character stored in one edge of the tree
E
B G
I
LG
BIG$
BIGGER$
BILL$GOOD$ GONG$
L
$
$
$
$ $
O
O
D
N
GG
R
TRIE - PropertiesTRIE - Properties
• An internal node can have from 1 to d children where d is the size of the alphabet.
• A path from the root of T to an internal node i corresponds to an i-character prefix of a string S
• The height of the tree is the length of the longest string
• If there are S unique strings, T has S external nodes
• Looking up a string of length M is O(M)
Compact TrieCompact TrieRemove chains leading to leaves
E
B G
I
LG
BIG$
BIGGER$
BILL$GOOD$ GONG$
L
$
$
$
$ $
O
O
D
N
GG
R
B G
I
LG
BIG$BIGGER$
BILL$ GOOD$ GONG$
$
O
O N
G
PATRICIA (PAT TREE)PATRICIA (PAT TREE)PPractical AAlgorithm TTo RRetrieve IInformation CCoded IIn AAlphabets -
introduced by D.R. Morrison in October 1968Collapse all unary nodes
B G
I
LG
BIG$BIGGER$
BILL$ GOOD$ GONG$
$
O
O N
G
BI GO
LG
BIG$BIGGER$
BILL$
GOOD$ GONG$$
O N
G
PATRICIA - CommentsPATRICIA - Comments
• Number of nodes proportional to number of strings
• Strings on edges become variable length• Use an auxiliary data structure to actually
store the strings• The TRIE only stores triplets of numbers
indicating where in the auxiliary data structure to look
Partially specified query (*)Partially specified query (*)
• Using n-grams– E.g., decompose labor into la-ab-bo-or– Mark begin/end of word with special char
($)– Labor ->$l-la-ab-bo-or-r$
• Answers to lab*r are obtained via$l and la and ab and r$
Use of bigramsUse of bigrams
Number Term
12345678
AbhorBearLaaberLaborLaboratorLabourLavacaberSlab
Bigram Term id
$a$b$l$saaabbolaorouraryr$sl
123,4,5,6,7831,3,4,5,6,7,84,5,63,4,5,6,7,81,4,56551,2,3,4,5,6,78
Problems with bi-gramsProblems with bi-grams
• Query $l and la and ab and r$– might be false match: – In example, will match laaber and lavacaber (from
TREC collection).. – Why: all bigrams are present, but not consecutive
• What about labrador?– Right answer but not what was intended– Wildcards with n-grams are risky!
• False matches can be checked with additional exact pattern matcher
Can we support wildcards with Can we support wildcards with TRIETRIE
• Easy to answer queries with wildcard tails– E.g., lab*, labor*, …
• What about lab*r or *bour?• Use a rotated lexicon
– Store: $labor, abor$l, bor$la, or$lab, r$labo, $labor
• To search for lab*r– rotate until wildcard is at end: r$lab* then– search for strings that have this prefix
Rotated LexiconRotated Lexicon
id Term
12345678
AbhorBearLaaberLaborLaboratorLabourLavacaberSlab
Rotated Add
$abhor$bear$laaber$labor$laborator$labour$lavacaber$slabaaber$labhor$aber$laabor$l
(1,0)(2,0)(3,0)(4,0)(5,0)(6,0)(7,0)(8,0)(3,2)(1,1)(3,3)(4,2)
Rotated Add
aborator$labour$laber$lavacab$slr$abhor$bear$laaber$labor$laborator$labour$lavacabeslab$
(5,2)(6,2)(7,6)(8,3)(1,5)(2,4)(3,6)(4,5)(5,9)(6,6)(7,9)(8,1)
What Goes in a Postings What Goes in a Postings File?File?• Boolean retrieval
– Just the document number
• Ranked Retrieval– Document number and term weight (TF*IDF,
...)
• Proximity operators– Word offsets for each occurrence of the
term• Example: Doc 3 (t17, t36), Doc 13 (t3, t45)
How Big Is the Postings File?How Big Is the Postings File?
• Very compact for Boolean retrieval– About 10% of the size of the documents
• If an aggressive stopword list is used!
• Not much larger for ranked retrieval– Perhaps 20%
• Enormous for proximity operators– Sometimes larger than the documents!
Indexing Small
(1 MB)
collection Medium
(200MB)
collection Large
(2GB)
Collection
Addressing
words
45% 73% 36% 64% 35% 63%
Addressing
documents
19% 26% 18% 32% 26% 47%
Addressing
64k blocks
27% 41% 18% 32% 5% 9%
Addressing
256 blocks
18% 25% 1.7% 2.4% 0.5% 0.7%
Full inversion (all words, exact positions, 4-byte pointers)
2 or 1 byte(s) per pointer independent of the text sizedocument size (10KB), 1, 2, 3 bytes per pointer, depending on text size
All words are indexedStop words are not indexed
Inverted File CompressionInverted File Compression
• Key idea, each postings list can be stored as an ascending sequence of integers
• Example: if t occurs in 8 documents, list the postings in ascending doc id– <8;3,5,20,21,23,76,77,78>
• Storing the gaps would change to <8; [3,2,15,1,2,53,1,1]>
Inverted File CompressionInverted File Compression
• More formally, if t appears in nt docs– <nt;d1,d2,…,dft>
– where dk < dk+1
• The postings can be equivalently stored as an initial position followed by a list of d-gaps: dk+1-dk
• Example– From <8;3,5,20,21,23,76,77,78> – To <8;3,2,15,1,2,53,1,1>
Inverted File CompressionInverted File Compression
• No information loss, but …… where does the saving come from?– The largest d-gap in the 2nd rep is
potentially the same as the largest document id in the 1st
– If there are N documents, and a fixed binary encoding is used, both methods require log N bits per any stored pointer…
IntuitionIntuition
• Frequent words have d-gaps significantly smaller than rare words
• Variable length representations are considered, in which small values are more likely and coded more economically than large one
• Various methods based on probability distribution of index numbers.
BitmapsBitmaps
• For every term in the lexicon, a bitvector is stored, each bit corresponding to a document. Set to 1 if term appears in document, set to 0 otherwise
• Thus if term “pot” appears in documents 2 and 5, the corresponding bits are set– pot -> 0100100
• Bitmaps are efficient for Boolean queries– Answers to a query are obtained by simply
combining bitvectors with Boolean operators
Example of bitmapsExample of bitmapsterm bitvector
cold 100100
days 001001
hot 100100
in 010010
it 010010
like 000110
nine 001001
old 001001
pease 110000
porridge 110000
pot 010010
some 000110
the 010010
Query “some” AND “pot”
Answer: 010010 AND 000110Yields: 000010d5 unique answer
What Are Signature Files?What Are Signature Files?
In your email programme you will find the option “signatures”. For example in outlook express go to “tools”, “options” and you will see “signatures”.
Signature files offer you as a person or business to state contact information at the end of an e-mail message. They are also known as "sig. files". Using sig. files is an opportunity to advertise is free.
Signature FilesSignature Files
Signature FilesSignature Files
• Signature files were popular in the past because they are implicitly compressed (and take less space than uncompressed inverted files)
• A signature file is a probabilistic method for indexing text: each document has an associated signature or descriptor, a string of bits that captures the content of the document
Not a new idea…Not a new idea…
Word signaturesWord signatures
• Assume that words have 16 bit descriptors constructed by taking 3 hash functions generating values between 1…16, and setting corresponding bits in a 16 bit descriptor
• Example– if 3 hashing values for cold are 3,5,16– Cold -> 1000 0000 0010 0100
• Very unlikely but not impossible that signatures of 2 words are identical
Example for given lexiconExample for given lexiconterm Hash string
cold 1000 0000 0010 0100
days 0010 0100 0000 1000
hot 0000 1010 0000 0000
in 0000 1001 0010 0000
it 0000 1000 1000 0010
like 0100 0010 0000 0001
nine 0010 1000 0000 0100
old 1000 1000 0100 0000
pease 0100 0100 0010 0000
porridge 0100 0100 0010 0000
pot 0000 0010 0110 0000
some 0100 0100 0000 0001
the 1010 1000 0000 0000
Document signaturesDocument signatures
• Signatures for each word in a given document are superimposed– The words signatures of each word in a
document d are ORed, to form the document signature
– Example • document 6: nine days old
ExampleExample
• nine 0010 1000 0000 0100• days 0010 0100 0000 1000• old 1000 1000 0100 0000• d6 “nine days old” 1010 1100 0100 1100
• To test whether a word is in a document, calculate its signature. If the corresponding bits are set in the document, there is a high probability that the document contains the word.
False hitsFalse hits
• Reduce false hits by increasing number of bits set for each term and increase length of signature
• Yet need to fetch the document at query time, and scan/parse/stem it to check that the term really occurs: costs time
• Effective when queries are long (lower probabilities of false hits)
Example of false hitsExample of false hits
Doc signature Text
1 1100 1111 0010 0101 Pease porridge hot, please porridge cold.
2 1110 1111 0110 0001 Pease porridge in the pot.
3 1010 1100 0100 1100 Nine days old.
4 1100 1110 1010 0111 Some like it hot, some like it cold
5 1110 1111 1110 0011 Some like it in the pot
6 1010 1100 0100 1100 Nine days old
cold 1000 0000 0010 0100old 1000 1000 0100 0000
Real matches?
1 and 42, 3, 5 and 6
Merits of Signature FilesMerits of Signature Files
• Faster than full text scanning– 1 or 2 orders of magnitude faster– But… still linear in collection size
• Modest space overhead– 10-15% vs. 50-300% (inversion)
• Insertions can be handled easily– append only– no reorganization or rewriting
Comparison of indexing Comparison of indexing methodsmethods
• Bitmap consumes an order of magnitude more storage than inverted files and signature files
• Signature files require un-necessary access to the main text because of false matches
• Signature file main advantage: no in-memory lexicon
• Compressed inverted files are the state-of-the art index structure used by most search engines
Search EnginesSearch Engines
Information from searchenginewatch.com
Searches Per Day (2000)Searches Per Day (2000)
Searches Per Day (2001)Searches Per Day (2001)
Challenges for Web SearchingChallenges for Web Searching
• Distributed data• Volatile data/”Freshness”: 40% of the web
changes every month• Exponential growth• Unstructured and redundant data: 30% of web
pages are near duplicates• Unedited data• Multiple formats• Commercial biases• Hidden data
From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Standard Web Search Engine ArchitectureStandard Web Search Engine Architecture
crawl theweb
create an inverted
index
Check for duplicates,store the
documents
Inverted index
Search engine servers
userquery
Show results To user
DocIds
More detailed architecture,
from Brin & Page 98.
Only covers the preprocessing in
detail, not the query serving.
Indexes for Web Search EnginesIndexes for Web Search Engines
• Inverted indexes are still used, even though the web is so huge
• Some systems partition the indexes across different machines– Each machine handles different parts of the
data
• Other systems duplicate the data across many machines– Queries are distributed among the machines
• Most do a combination of these
In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries.
Each row can handle 120 queries per second
Each column can handle 7M pages
To handle more queries, add another row.
From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm
Querying: Cascading Allocation of CPUsQuerying: Cascading Allocation of CPUs
• A variation on this that produces a cost-savings:– Put high-quality/common pages on many
machines– Put lower quality/less common pages on
fewer machines– Query goes to high quality machines first– If no hits found there, go to other machines
Google Google
• Google maintains the worlds largest Linux cluster (10,000 servers)
• These are partitioned between index servers and page servers– Index servers resolve the queries (massively
parallel processing)– Page servers deliver the results of the
queries
• Over 3 Billion web pages are indexed and served by Google
Web Page RankingWeb Page Ranking
• Varies by search engine– Pretty messy in many cases– Details usually proprietary and fluctuating
• Combining subsets of:– Term frequencies– Term proximities– Term position (title, top of page, etc)– Term characteristics (boldface, capitalized, etc)– Link analysis information– Category information– Popularity information
Ranking: Link AnalysisRanking: Link Analysis
• Assumptions:– If the pages pointing to this page are good,
then this is also a good page– The words on the links pointing to this page
are useful indicators of what this page is about
– References: Page et al. 98, Kleinberg 98
Ranking: Link AnalysisRanking: Link Analysis
• Why does this work?– The official Toyota site will be linked to by
lots of other official (or high-quality) sites– The best Toyota fan-club site probably also
has many links pointing to it– Less high-quality sites do not have as many
high-quality sites linking to them
Web CrawlingWeb Crawling
• How do the web search engines get all of the items they index?
• Main idea: – Start with known sites– Record information for these sites– Follow the links from each site– Record information found at new sites– Repeat
Web CrawlersWeb Crawlers
• How do the web search engines get all of the items they index?
• More precisely:– Put a set of known sites on a queue– Repeat the following until the queue is empty:
• Take the first page off of the queue• If this page has not yet been processed:
– Record the information found on this page» Positions of words, links going out, etc
– Add each link on the current page to the queue– Record that this page has been processed
• Depth-first/breadth-first/…
Sites Are Complex Graphs, Sites Are Complex Graphs, Not Just TreesNot Just Trees
Page 1
Page 3Page 2
Page 1
Page 2
Page 1
Page 5
Page 6
Page 4
Page 1
Page 2
Page 1
Page 3
Site 6
Site 5
Site 3
Site 1 Site 2
Web Crawling IssuesWeb Crawling Issues
• Keep out signs– A file called robots.txt tells the crawler which
directories are off limits• Freshness
– Figure out which pages change often– Recrawl these often
• Duplicates, virtual hosts, etc– Convert page contents with a hash function– Compare new pages to the hash table
• Lots of problems– Server unavailable– Incorrect html– Missing links– Infinite loops
• Web crawling is difficult to do robustly!
XML, IR, … databases?XML, IR, … databases?
What’s XML?What’s XML?
• W3C Standard since 1998 – Subset of SGML – (ISO Standard Generalized Markup
Language)
• Data-description markup language– HTML text-rendering markup language
• De facto format for data exchange on Internet– Electronic commerce– Business-to-business (B2B) communication
XML Data Model HighlightsXML Data Model Highlights
• Tagged elements describe semantics of data– Easier to parse for a machine and for a human
• Element may have attributes • Element can contain nested sub-elements• Sub-elements may themselves be tagged
elements or character data• Usually considered as tree structure … but
really a graph!
An XML DocumentAn XML Document
<? xml version=" 1.0"?><! DOCTYPE sigmodRecord SYSTEM “sigmodRecord. dtd"><sigmodRecord> <issue> <volume> 1</ volume> <number> 1</ number> <articles> <article> <title> XML Research Issues</ title> <initPage> 1</ initPage> <endPage> 5</ endPage> <authors> <author AuthorPosition=" 00"> Tom Hanks</ author> </ authors> </ article> </ articles> </ issue>
XPATH ExampleXPATH Example
• List the titles of articles in which the author has “Tom Hanks”– //article[//author=“Tom Hanks”]/title
• Find the titles of articles authored by “Tom Hanks” in volume 1.– //issue[/volume=“1”]/articles/article/[//
author=“TomHanks”]/title
XQuery: ExampleXQuery: Example
List the titles of the articles authored by “Tom Hanks”
Query Expressionfor $b IN
document(“sigmodRecord.xml")//articlewhere $b//author =“Tom Hanks"return <title>$b/title.text()</title>
Query Result<title>XML Research Issues</title>
Data- vs. Document-centricData- vs. Document-centric
• Data-centric– Highly structured– Usage stems from
exchange of data from databases
• Example:
• Document-centric– Semi-structured– Embedded tags– Fulltext search
important– Usage stems from
exchange of formatted text<site>
<item ID=“I001”> <name>Chair</name> <description>This chair is in good condition ... </description> </item> <item ID=“I002”> <name>Table</table> <description>...</description> </item> ...</site>
<memo> <author>John Doe</author> <title>...</title> <body> This memo is meant for all persons responsible for <list bullets=1> <item>either <em>customers</em> abroad, <item>or <em>suppliers</em> abroad. </list> ...</memo>
Classes of XML Documents Classes of XML Documents
• Structured– “Un-normalized” relational data Ex: product catalogs, inventory data,
medical records, network messages, logs, stock quotes
• Mixed – Structured data embedded in large text
fragments Ex: On-line manuals, transcripts, tax forms
Queries on Mixed DataQueries on Mixed Data
• Full-text search operatorsEx: Find all <bill>s where "striking" &
"amended" are within 6 intervening words
• Queries on structure & textEx: Return <text> element containing both
"exemption" & "social security" and preceding & following <text> elements
• Queries that span (ignore) structureEx: Return <bill> that contains “referred to
the Committee on Financial Services”
XML IndexingXML Indexing
A
B
T1
C
D
T2 T3
E
T4
A
B C
D ET1
T2 T3
XML Document SchemaXML Document Schema
B
T1
C
D
T2 T3
E
T4
A
Node index
Word index
ID S E P
A 0 0 13 -
B 1 1 3 1
C 2 4 12 1
D 3 5 8 2
… … … … …
T1 2
T2 6
T3 7
Text Region AlgebrasText Region Algebras
A
B
T1
C
D
T2 T3
E
T4
A
B C
D ET1
T2 T3
contains(A, nodeindex)
Text Region AlgebrasText Region Algebras
A
B
T1
C
D
T2 T3
E
T4
A
B C
D ET1
T2 T3
contained-by(D, nodeindex)
Using Region AlgebrasUsing Region Algebras
B
T1
C
D
T2 T3
E
T4
B
T1
C
D
T2 T3
E
T4
roots := contains and not contained-by
A A
Using Region AlgebrasUsing Region Algebras
B
T1
C
D
T2 T3
E
T4
B
T1
C
D
T2 T3
E
T4
leafs := contained-by and not contains
A A
Using Region AlgebrasUsing Region Algebras
B
T1
C
D
T2 T3
E
T4
B
T1
C
D
T2 T3
E
T4
inner := diff(diff(nodes, leafs),roots)
XML Region Operators - 1XML Region Operators - 1
• Containment– Contains and contained-by
• Length– Length (including markup) and text-length (no
markup)
• Other possibilities include…– Range or distance - enabling references– Overlaps– (Minimal) bounding region (direct parent)
XML Region Operators – 2XML Region Operators – 2
• Transitive closure– Containment - select on word index with
known lower and upper bound
• Regeneration of XML document– Containment on node index – Containment on word index– Union of node and word subsets
Inverted Lists for XML Inverted Lists for XML DocumentsDocuments
(1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … …
(1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … …
<section>
<title>
(1, 3, 2) … …
(1, 4, 2) … … “retrieval”
“information”
Element index
Text index
<section> <title> Information Retrieval Using RDBMS </title> <section> <title> Beyond Simple Translation </title> <section> <title> Extension of IR Features </title> </section> </section></section>
1
2 3 4 5 6 7
89 10 11 12 13
14
15 16 17 18 19 20
21
22
23
Containment, direct containment, tight containment, proximity
Inverted Lists to Relational Inverted Lists to Relational TablesTables
<section>
“information”
(1, 1:23, 0)
(1; 8:22, 1)
(1; 14:21, 2)
(1, 3, 2) ……
…...section 1 1 23 0
section 1 8 22 1section 1 14 21 2
…...
term docno begin end level
ELEMENTS
term docno wordno level
information 1 3 2
…...
…...
TEXTS
InvertedList
InvertedList
Inverted List Operations to SQL Inverted List Operations to SQL TranslationTranslation-- E // ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and e.begin < t.wordno and t.wordno < e.end
-- E / ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and e.begin < t.wordno and t.wordno < e.end and e.level = t.level - 1
-- E = ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and t.wordno = e.begin + 1 and e.end = t.wordno + 1
-- distance(“T1”, “T2”) <= n select * from TEXTS t1, TEXTS t2 where t1.term = ‘T1’ and t2.term = ‘T2’ and t1.docno = t2.docno and t2.wordno > t1.wordno and t2.wordno <= t1.wordno + n
Data Sets Shakespeare DBLP SyntheticSize of documents 8 MB 53 MB 207 MBinverted index size 11 MB 78 MB 285 MBrelational table size 15 MB 121 MB 566 MBNumber of distinct elements 22 598 715Number of distinct text words 22,825 250,657 100,000Total number of elements 179,726 1,595,010 4,999,500Total number of text words 474,057 3,655,148 19,952,000
Experimental Setup: Experimental Setup: WorkloadWorkload
• Data sets
• 13 micro-benchmark queries:– Of form: ‘elem’ contains ‘word’– Vary ‘elem’ and ‘word’ of different frequencies:
check sensitivity of performance w.r.t. selectivity
Number of Term OccurrencesNumber of Term Occurrences
Queries term1 frequency term2 frequency result rowsQS1 90 277 2QS2 107,833 277 36QS3 107,833 3,231 1,543QS4 107,833 1 1QD1 654 55 13QD2 4,188 712 672QD3 287,513 6,363 6,315QD4 287,513 3 3QG1 50 1,000 809QG2 134,900 55,142 1,470QG3 701,000 165,424 21,936QG4 50 82,712 12QG5 701,000 17 4
DB2/Inv. List Engine DB2/Inv. List Engine Performance RatiosPerformance Ratios
• Want to know why RDBMS:– sometimes performs much better– usually performs much worse
• Two significant factors: Join algorithm, cache utilization
‘T’
Why Does RDBMS Sometimes Perform Why Does RDBMS Sometimes Perform Better?Better?
ELEMENTS(or TEXTS)
TEXTS(or ELEMENTS)
indexindex
-- E // ‘T’ select * from ELEMENTS e, TEXTS t where e.term = ‘E’ and t.term = ‘T’ and e.docno = t.docno and e.begin < t.wordno and t.wordno < e.end
Index Nested Loop
‘E’ ‘T’ list‘E’ list
RDBMS Inverted List Engine
More CPU
More I/O
MPMGJN vs. Standard Merge MPMGJN vs. Standard Merge JoinJoin
d b e5 7 20
5 14 195 21 285 22 275 29 315 32 40
d w
MPMGJN
5 25 235 245 335 37
5 42
Inverted List Engine
d b e5 7 20
5 14 195 21 285 22 275 29 315 32 40
d w
Standard Merge Join
5 25 235 245 335 37
5 42
RDBMS
Three merging predicatesd1 = d2, b <= w, w <= e
One merging predicate: d1 = d2,Additional filtering predicates:b <= w, w <= e
Number of Row Pairs Number of Row Pairs ComparedComparedQueries d1=d2,b<=w<=e d1=d2
QS1 5 1,653 QS2 7,131 984,948 QS3 89,716 10,175,904 QS4 2,366 3,475 QD1 503 555 QD2 4,723 1,315,662 QD3 263,458 14,082,080 QD4 1,766 4,950 QG1 1,000 1,000 QG2 103,994 148,773,116 QG3 610,816 2,319,244,480 QG4 12 82,712 QG5 56,084 238,340
MPMGJN vs. Index Nested Loop MPMGJN vs. Index Nested Loop JoinJoin
d b e5 7 20
5 14 195 21 285 22 275 29 315 32 40
d w
MPMGJN
5 25 235 245 335 37
5 42
d b e5 7 20
5 14 195 21 285 22 275 29 315 32 40
d w
Index Nested-loop Join
5 25 235 245 335 37
5 42
Index
Inverted List Engine RDBMS
MPMGJN vs. Index nested-loop join, depends on:
number of index key comparisons vs. records comparisonscost of index key comparisons vs. records comparisons
RDBMS vs. Special-purpose?RDBMS vs. Special-purpose?
• Recap:– RDBMS’s have great technology (e.g., B+-
tree, optimization, storage)– but not well-tuned for containment queries
using inverted lists
• Open question:– use RDBMS or special purpose engine to
process containment queries using inverted lists?
ThanksThanks
• Ron Larson• Marti Hearst• Doug Oard• Philip Resnik• Sriram Raghavan• Johan List• Maurice van Keulen• David Carmel• Aya Soffer• …