Download - DD2476: Lecture 3 Large indexes
1/23/2013
1
DD2476: Lecture 3
• Today: More indexing issues
– Large indexes
– Distributed indexing
– Dynamic indexing
– Index compression
Large indexes
• The web is big:
– 1998: 26 million unique web pages
– 2008: 1 000 000 000 000 (1 trillion) unique web pages!
• What if the index is too large to fit in main memory?
– Dictionary in main memory, postings on disk
– Dictionary (partially) on disk, postings on disk
– Index on several disks (cluster)
friend
roman
countryman
1 10 12 16 82
6
9
10
10
23
Mixed MM and disk storage
• Tree: Nodes below a certain depth stored on disk
• Hashtable: All postings put on disk, hash keys in MM
• A distributed hash table allows keys and postings to be distributed of a large number of computers
Hardware basics
• Access to data in memory is much faster than access
to data on disk.
• Disk seeks: No data is transferred from disk while the
disk head is being positioned.
– Therefore: Transferring one large chunk of data from disk
to memory is faster than transferring many small chunks.
– Disk I/O is block-based: Reading and writing of entire
blocks (as opposed to smaller chunks).
– Block sizes: 8KB to 256 KB.
1/23/2013
2
Hardware assumptions
• In the book:
statistic value
average seek time 5 ms = 5 x 10−3 s
transfer time per byte 0.02 μs = 2 x 10−8 s
processor’s clock rate 109 s−1 (1 GHz)
low-level operation 0.01 μs = 10−8 s(e.g., compare & swap a word)
size of main memory several GB
size of disk space 1 TB or more
Basic indexing
• Term-document pairs are collected when
documents are parsed
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 1
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Doc 2
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Sorting step .
• The list of term-doc pairs
is sorted
• This must be done on disk
for large lists
• Goal: Minimize the number
of disk seeks
Term Doc #
I 1
did 1
enact 1
julius 1
caesar 1
I 1
was 1
killed 1
i' 1
the 1
capitol 1
brutus 1
killed 1
me 1
so 2
let 2
it 2
be 2
with 2
caesar 2
the 2
noble 2
brutus 2
hath 2
told 2
you 2
caesar 2
was 2
ambitious 2
Term Doc #
ambitious 2
be 2
brutus 1
brutus 2
capitol 1
caesar 1
caesar 2
caesar 2
did 1
enact 1
hath 1
I 1
I 1
i' 1
it 2
julius 1
killed 1
killed 1
let 2
me 1
noble 2
so 2
the 1
the 2
told 2
you 2
was 1
was 2
with 2
A bottleneck
• Say we want to sort 100,000,000 term-doc-pairs.
• A list can be sorted by N log2 N comparison
operations.
– How much time does that take? (assume 10−8 s/operation)
• Suppose that each comparison additionally took 2 disk
seeks
– How much time? (assuming 5 x 10−3 /disk seek)
1/23/2013
3
Scaling index construction
• In-memory index construction does not scale.
• How can we construct an index for very large collections?
– As we build the index, we parse docs one at a time.
– The final postings for any term are incomplete until the end.
– At 12 bytes per non-positional postings entry (term, doc, freq), demands a lot of space for large collections.
– Swedish Wikipedia has > 185,000,000 tokens
– Can be done in-memory in 2012, but real collections are much larger
– Thus: We need to store intermediate results on disk
Blocked sort-based indexing
• 12-byte (4+4+4) records (term, doc, freq).
– Define a Block ~ 10M of such records
– Accumulate postings for each block, sort, write to disk.
– Then merge the blocks into one long sorted order.
Sorting 10 blocks of 10M records
• First, read each block and sort within:
– Quicksort takes N ln N expected steps
– In our case (10M) ln (10M) steps
•• Exercise: estimate total time to read each block from Exercise: estimate total time to read each block from
disk and disk and andand quicksortquicksort it.it.
– assuming transfer time 2 x 10−8 s per byte
• 10 times this estimate – gives us 10 sorted runs of
10M records each.
Blocked sort-based indexing
1/23/2013
4
From BSBI to SPIMI
• BSBI requires that the dictionary can be kept in main
memory
• Alternative approach: Construct several separate
indexes and merge them
– Generate separate dictionaries for each block – no need to
maintain term-termID mapping across blocks.
– No need to keep dictionary in main memory
– Accumulate postings directly in postings list (as in lab 1).
• This is called SPIMI – Single-Pass In-Memory Index
construction (Figure 4.4)
SPIMI-invert
Distributed indexing
• For web-scale indexing, we must use a distributed
computing cluster
• Individual machines are fault-prone
– Can unpredictably slow down or fail
• Google (2007): estimated total of 1 million servers, 3
million processors/cores
• Estimate: Google installs 100,000 servers each quarter.
– Based on expenditures of 200–250 million dollars per year
– This would be 10% of the computing capacity of the world!?!
Distributed indexing
• Use two classes of parallel tasks
– Parsing
– Inverting
• Input is split broken into splits
– each split is a subset of documents
• Master computer assigns a split to an idle machine
– Parser will read a document and sort (t,d) pairs
– Inverter will merge, create and write postings
1/23/2013
5
Distributed indexing: Architecture
• Uses an instance of MapReduce
– general architecture for distributed computing
– simple programming model
– manages interactions among clusters of cheap commodity
servers
• Uses Key-Value pairs as primary object of
computation
• An open-source implementation is “Hadoop” by
apache.org
Distributed indexing: Architecture
splits
Parser
Parser
Parser
Master
a-f g-p q-z
a-f g-p q-z
a-f g-p q-z
Inverter
Inverter
Inverter
Postings
a-f
g-p
q-z
assign assign
Map
phaseSegment files
Reduce
phase
MapReduce
• Map takes an input key-value pair, produces set of
intermediate key-value pairs
• Reduce takes an intermediate key and a set of
values, produces another set of values
map(String key, String value):
// key: document name
// value: contents
for each word w in value:
Emit(w, ”1”) reduce(String key, Iterator values):
// key: a word
// value: a list of ”1”
int result = 0
for each word v in values:
result += ParseInt(v)
Emit( result.toString() )
Example: word count
Index construction with MapReduce
d1:I did enact…, d2:So let it…
→
(I:d1,did:d1,… caesar:d1,…,caesar:d2…)→
(caesar:(d1,d2), julius:(d1), …)
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
map
reduce
d1d2
1/23/2013
6
Index construction with MapReduce
d1:I did enact…, d2:So let it…
→
(I:d1,did:d1,… caesar:d1,…,caesar:d2…)→
(caesar:(d1,d2), julius:(d1), …)
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
d1d2
map(String key, String value):
// key: document name
// value: contents
for each word w in value:
Emit(w, key); reduce(String key, Iterator values):
// key: a word
// value: a list of docIDs
Emit( key:values.removeDuplicates(); )
Exercise
• Web documents contain links to one another
• If docs A,B and C link to D, then D’s list of backlinks is
(A,B,C).
• How can we compute the list of backlinks for each
document using the MapReduce framework?
• Give pseudo-code for the Map and Reduce functions.
Dynamic indexing
• Up to now, we have assumed that collections are
static.
• They rarely are:
– Documents come in over time and need to be inserted.
– Documents are deleted and modified.
• This means that the dictionary and postings lists have
to be modified:
– Postings updates for terms already in dictionary
– New terms added to dictionary
Simplest approach
• Maintain “big” main index
• New docs go into “small” auxiliary index
• Search across both, merge results
• Deletions
– Invalidation bit-vector for deleted docs
– Filter docs output on a search result by this invalidation
bit-vector
• Periodically, re-index into one main index
1/23/2013
7
Logarithmic merge
• Maintain a series of indexes, each twice as large as
the previous one.
• Keep smallest (Z0) in memory
• Larger ones (I0, I1, …) on disk
• If Z0 gets too big (> n), write to disk as I0
– or merge with I0 (if I0 already exists) as Z1
• Either write merge Z1 to disk as I1 (if no I1)
– or merge with I1 to form Z2
• etc.
Dynamic indexing at search engines
• All the large search engines now do dynamic
indexing
• Their indices have frequent incremental changes
– News items, blogs, new topical web pages
• Whitney Houston, Håkan Juholt, …
• But (sometimes/typically) they also periodically
reconstruct the index from scratch
– Query processing is then switched to the new index, and
the old index is then deleted
Why compression (in general)?
• Use less disk space
– Saves a little money
• Keep more stuff in memory
– Increases speed
• Increase speed of data transfer from disk to memory
– [read compressed data | decompress] is faster than
[read uncompressed data]
– Premise: Decompression algorithms are fast
• True of the decompression algorithms we use
Why compression for indexes?
• Dictionary
– Make it small enough to keep in main memory
– Make it so small that you can keep some postings lists in main memory too
• Postings file(s)
– Reduce disk space needed
– Decrease time needed to read postings lists from disk
– Large search engines keep a significant part of the postings in memory.
• Compression lets you keep more in memory
• The book presents various IR-specific compression schemes
1/23/2013
8
Lossless vs lossy compression
• Lossless compression: All information is preserved.
– What we mostly do in IR.
• Lossy compression: Discard some information
• Several of the preprocessing steps can be viewed as
lossy compression: case folding, stop words,
stemming, number elimination.
Vocabulary vs collection size
• How many distinct words are there?
• Heaps’ Law: M=kTb
M = vocabulary size, T = number of tokens
• Typically 30 ≤ k ≤ 100, b ≈ 0.5
– RCV1 corpus used in the book has 38,365 terms on first
1,000,000 tokens, and k=44.
– Last year’s course corpus: 38,220 terms on 684,540 tokens,
k ≈ 46
– However, Swedish Wikipedia has 3,266,333 terms on
185,546,616 tokens, k ≈ 237
Swedish Wikipedia
tokens
terms
Distribution of terms
• Zipf’s law describes the relative frequency of terms:
cfi = K/i
cfi is the collection frequency : the number of
occurrences of the term ti in the collection
• Consequence: If the most frequent term (the) occurs
cf1 times
– then the second most frequent term (of) occurs cf1/2 times
– the third most frequent term (and) occurs cf1/3 times …
1/23/2013
9
Swedish Wikipedia
i 3648969
quot 3460170
och 2878013
lt 2411738
gt 2365764
a 2344229
en 2169456
av 1718654
kategori 1613462
som 1518518
är 1232640
att 1215843
på 1174394
ö 1138493
de 1127375
den 1071467
till 1052663
med 1029549
för 975414
ref 842771
Swedish Wikipedia
• Top 30 terms account for 41,238,989 tokens
– about 22% of all tokens
Dictionary compression
• Why compress the dictionary?
– Search begins with the dictionary
– We want to keep it in memory
– Memory footprint competition with other applications
– Embedded/mobile devices may have very little memory
– Even if the dictionary isn’t in memory, we want it to be
small for a fast search startup time
– So, compressing the dictionary is important
First idea
Terms Freq. Postings ptr.
a 656,265
aachen 65
…. ….
zulu 221
20 bytes 4 bytes each
• Array of fixed-width entries
– ~400,000 terms; 28 bytes/term = 11.2 MB.
Dictionary search
structure
1/23/2013
10
Compressing the term list
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
Total string length =
400K x 8B = 3.2MB
Pointers resolve 3.2M
positions: log23.2M =
22bits = 3bytes
•Store dictionary as a (long) string of characters:�Pointer to next word shows end of current word
�Hope to save up to 60% of dictionary space.
Blocking
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
7
Save 9 bytes
on 3
pointers.
Lose 4 bytes on
term lengths.
• Store pointers to every kth term string.
– Example below: k=4.
• Need to store term lengths (1 extra byte)
Front coding
• Front-coding:
– Sorted words commonly have long common prefix
– store differences only
– (for last k-1 in a block of k)
8automata8automate9automatic10automation→8automat*a1◊e2◊ic3◊ion
Encodes automatExtra length
beyond automat.
Begins to resemble general string compression.
Compressions on the corpus in the book
Technique Size in MB
Fixed width 11.2
Dictionary-as-String with pointers to every
term
7.6
Also, blocking k = 4 7.1
Also, Blocking + front coding 5.9
1/23/2013
11
Compressing postings
• We store the list of docs containing a term in
increasing order of docID.
– computer: 33,47,154,159,202 …
• Consequence: it suffices to store gaps.
– 33,14,107,5,43 …
• Hope: most gaps can be encoded/stored with
far fewer than if we store docIDs.
Three postings entries
Variable-length encoding
• Aim:
– For arachnocentric, we will use ~20 bits/gap entry.
– For the, we will use ~1 bit/gap entry.
• If the average gap for a term is G, we want to use
~log2G bits/gap entry.
• Key challenge: encode every integer (gap) with about
as few bits as needed for that integer.
• This requires a variable length encoding
• Variable length codes achieve this by using short
codes for small numbers
Variable-byte codes
• Begin with one byte to store G and dedicate 1 bit in it
to be a continuation bit c
• If G ≤127, binary-encode it in the 7 available bits and
set c =1
• Else encode G’s lower-order 7 bits and then use
additional bytes to encode the higher order bits
using the same algorithm
• At the end set the continuation bit of the last byte to
1 (c =1) – and for the other bytes c = 0.
1/23/2013
12
Example
docIDs 824 829 215406
gaps 5 214577
VB code 00000110
10111000
10000101 00001101
00001100
10110001
Postings stored as the byte concatenation
000001101011100010000101000011010000110010110001