comparing distributed indexing to mapreduce or not?
TRANSCRIPT
Comparing Distributed Indexing:To Mapreduce or Not?
Richard McCreadie
Craig Macdonald
Iadh Ounis
Talk Outline
1. Motivations
2. Classical Indexing– Single-Pass Indexing– Shared-Nothing/Corpus Indexing
3. MapReduce– What is MapReduce?– Indexing Strategies for MapReduce
4. Experimentation and Results– Measures and Environment– Using Shared-Corpus Indexing as a baseline– Comparing MapReduce Indexing Techniques
5. Conclusions
MOTIVATIONS•
1.Why is Efficient Indexing Important
2.Contributions
Why is Efficient Indexing Important?
• Indexing is an essential part of any IR system • But test corpora have grown exponentially
• This has reinvigorated the need for efficient indexing
Collection Data Year Docs Size(GB)
WT2G Web 1999 240k 2.0
GOV Web 2002 1.8M 18.0
Blogs06 Blogs 2006 3M 13.0
GOV2 Web 2004 25M 425.0
ClueWeb09 Web 2009 1.2B 25,000
• Commercial organisations and research groups alike have embraced scale-out approaches
• MapReduce has been widely adopted by the commercial search engines
– – –
• No studies into its suitability for indexing
Solutions?
MapReduce
Contributions
• We examine the benefits to be gained from indexing with MapReduce ( )
• We discuss 4 indexing techniques in MapReduce– 3 existing techniques– 1 novel technique inspired by single-pass indexing
• We then compare them to a shared-corpus distributed indexing strategy
CLASSICAL INDEXING
1.Classical Indexing
2.Single-Pass Indexing
3.Shared-Nothing & Shared Corpus Distributed Indexing
Classical Indexing
term
Total docs
Total frequency
pointerDocument number
frequency
Need to build two important structures:
• Inverted Index: posting lists for each term: <docid, frequency>
• Lexicon: term information and pointer to correct posting list
• State-of-the-art indexing uses a single-pass strategy.
(I.H. Witten, A. Moffat and T.C. Bell, 1999)
Lexicon
Posting List
1. Parse collection
2. Build postings in memory
3. Merge inverted indices
Single-Pass In-Memory Indexing
RAMt1
t2
t3
<><> <>
<>
<> <>
Indexer
DISK Final Inverted Index
CompressedFiles
% Used
How can Indexing be Distributed?
• Two classical approaches
– Shared-Nothing• Indexer on every machine
• Index local data
• Optimal!
– Shared-Corpus• Indexer on every machine
• Index remote data over NFS
Problems with Classical Approaches
• Shared-Nothing– No fault tolerance– Only a single copy of data– Jobs have to be manually administered– Data may not be available locally
• Shared-Corpus– No fault tolerance– Single point of failure (server)– Constrained by network bandwidth and server speed– Jobs have to be manually administered
• MapReduce reportedly solves these issues
MAPREDUCE INDEXING
1.MapReduce
2.MapReduce Indexing Strategies
MapReduce
• Programming paradigm by• Splits jobs into map and reduce operations
1. Map : map function (indexing) over each entry in input
2. Sort map output locally
3. Reduce : merge map output
• Provides:– Convenient programming API– Automatic job control– Fault tolerance– Distributed data storage (DFS)– Data replication
Indexing with MapReduce
• Implies that indexing is trivial . . .• Could a more complex approach perform better?• We try multiple approaches:
“Map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document ID’s and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.”
Dean & Ghemawat, MapReduce: Simplified data processing on large clusters. OSDI 2004
Approach Emits Sorting Num emits per map
Emit size
D&G_Token Tokens Lots Lots Tiny
D&G_Term Terms Lots Many Tiny
Nutch Documents Little Some Average
Single-Pass Posting lists Some Few Large
Emit each word in the document
A big intermediate
sort
Lots of merging
D&G_Token & D&G_Term
• D&G_Token– Map : For each token in document
emit (token, document-ID)– Sort : by token and document-ID– Reduce : For each unique token (term)
sum repeated document-Ids to get tf’s
write posting list for that term
• D&G_Term– Map : For each term in document
emit (term, document-ID, tf)– Sort : by term and document-ID– Reduce : For each term
write posting list for that term
• Based on Dean & Ghemawat’s MapReduce Paper (OSDI 2004)
Emit every token
Emit every term
Nutch Style Indexing
• Nutch (lucene) provides MapReduce indexing• We investigated v0.9
– Map : For each document
analyse document
emit(document-ID, analysed-document)– Sort : by document-ID (URL)– Reduce : For each document-ID
build Nutch document
index Nutch document
• Approximately equivalent to using a null-mapper• Emits less than the D&G approaches• But we believe we can do better . . .
Emit every document
Our Single-Pass Indexing Strategy
• We propose a novel adaptation of Single-Pass indexing
• Idea:– Use local machine memory – Build useful compressed structures– Emit less, therefore less IO and less sorting
• Single-Pass Indexing– Map : For each document
add document to compressed in-memory partial index
if (memory near full) flush:
for each term in partial index
emit(term, partial posting list)
Emit limited Posting-Lists
Our MapReduce Indexing Strategy (2)
– Sort : map, flush, and term– Reduce : for each term
merge partial posting lists
write out merged posting list
• Maps only emit compressed posting lists (less IO)• Few emits so sorting is easy
• We now evaluate these indexing strategies . . .
EXPERIMENTATION & RESULTS •
1.Measures and Setup
2.Indexing Throughput and Scaling
Research Questions
• Recall, we want to evaluate MapReduce for Indexing large-scale collections
• Questions:– Is Shared-Corpus Indexing sufficient for large-scale
collections (baseline)
– Can MapReduce perform close to Shared-Nothing Indexing (optimal)
Evaluation of MapReduce Indexing
• Two measures are used:• Measure speed with throughput:
(Compressed) Collection Size
Total Time Taken to Index
• Measure scaling with speedup:Total Time Taken to Index on a 1 Machine
Total Time Taken to Index using m Machines
• Optimal speedup would be where m machines were m times faster- known as linear speedup
Experimental Setup
• All indexing strategies are implemented in• The only fully open-source MapReduce implementation• Apache Software Foundation Project• Reportedly used by for indexing
• Cluster Setup• 4 (3) cores, 2.4GHz, 4GB RAM, ~1TB of hdd• Single gigabit Ethernet rack • v0.18.2 (HDFS)• Hadoop on Demand (HOD)• Torque Resource Manager (v2.1.9)• Also a RAID5 file server with 8 3GHz cores
• Use the TREC .GOV2 corpus• 25 million documents, 425GB uncompressed
Target Indexing Throughput
Indexing Strategy 1 2 4 6 8
Shared-Nothing Distributed 3 6 12 18 24
Shared-Corpus Distributed
MapReduce D&G_Token
MapReduce D&G_Term
MapReduce Single-Pass
Number of Machines Allocated
Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines
• Terrier Single-Pass indexing has a throughput of 1MB/sec for a single core• We project Shared-Nothing as being optimal (linear speedup)
Baseline Indexing Throughput
Indexing Strategy 1 2 4 6 8
Shared-Nothing Distributed 3 6 12 18 24
Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8
MapReduce D&G_Token
MapReduce D&G_Term
MapReduce Single-Pass
Number of Machines Allocated
• Use shared-corpus indexing as a baseline• No improvement after 4 machines• File server acts as a bottleneck• Can’t scale files servers indefinitely
Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines
D&G_Token Indexing Throughput
Indexing Strategy 1 2 4 6 8
Shared-Nothing Distributed 3 6 12 18 24
Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8
MapReduce D&G_Token - - - - -
MapReduce D&G_Term
MapReduce Single-Pass
Number of Machines Allocated
• No results• Runs failed due to there being to much map output, disks ran out of space
Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines
D&G_Term Indexing Throughput
Indexing Strategy 1 2 4 6 8
Shared-Nothing Distributed 3 6 12 18 24
Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8
MapReduce D&G_Token - - - - -
MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38
MapReduce Single-Pass
Number of Machines Allocated
• Indexing was possible• But scaling was strongly sub-linear• Still too much emitting• Half speed of Shared-Corpus approach
Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines
Single-Pass Indexing Throughput
Indexing Strategy 1 2 4 6 8
Shared-Nothing Distributed 3 6 12 18 24
Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8
MapReduce D&G_Token - - - - -
MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38
MapReduce Single-Pass 2.59 5.19 9.45 13.16 17.31
Number of Machines Allocated
• Faster than both D&G_Term and Shared Corpus approaches• Still scales sub-linearly• Data locality is the main concern
Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines
CONCLUSION
Conclusions
• Shared-Corpus indexing is limited by the speed of the central file store
• Performing efficient indexing in MapReduce is not trivial – Indeed, all indexing strategies implemented here scale sub-linearly
• Single-pass indexing is only marginally sub-linear– Small price to pay for the advantages?
• Using single-pass indexing MapReduce is suitable for indexing the current generation of web corpora
– We indexed the ‘B’ set of ClueWeb09 is just under 20 hours using 5 machines (15 cores)
– It has been reported on the TRECWeb mailing list that using Indri and Amazon EC2 (4 cores) and S3 that indexing takes 31 hours with a Shared-Nothing approach
. . . plus 10 days upload time!
Questions?
http://terrier.org