comparing distributed indexing to mapreduce or not?

30
Comparing Distributed Indexing: To Mapreduce or Not? Richard McCreadie Craig Macdonald Iadh Ounis

Upload: terrierteam

Post on 20-Aug-2015

1.498 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Comparing Distributed Indexing To Mapreduce or Not?

Comparing Distributed Indexing:To Mapreduce or Not?

Richard McCreadie

Craig Macdonald

Iadh Ounis

Page 2: Comparing Distributed Indexing To Mapreduce or Not?

Talk Outline

1. Motivations

2. Classical Indexing– Single-Pass Indexing– Shared-Nothing/Corpus Indexing

3. MapReduce– What is MapReduce?– Indexing Strategies for MapReduce

4. Experimentation and Results– Measures and Environment– Using Shared-Corpus Indexing as a baseline– Comparing MapReduce Indexing Techniques

5. Conclusions

Page 3: Comparing Distributed Indexing To Mapreduce or Not?

MOTIVATIONS•

1.Why is Efficient Indexing Important

2.Contributions

Page 4: Comparing Distributed Indexing To Mapreduce or Not?

Why is Efficient Indexing Important?

• Indexing is an essential part of any IR system • But test corpora have grown exponentially

• This has reinvigorated the need for efficient indexing

Collection Data Year Docs Size(GB)

WT2G Web 1999 240k 2.0

GOV Web 2002 1.8M 18.0

Blogs06 Blogs 2006 3M 13.0

GOV2 Web 2004 25M 425.0

ClueWeb09 Web 2009 1.2B 25,000

Page 5: Comparing Distributed Indexing To Mapreduce or Not?

• Commercial organisations and research groups alike have embraced scale-out approaches

• MapReduce has been widely adopted by the commercial search engines

– – –

• No studies into its suitability for indexing

Solutions?

MapReduce

Page 6: Comparing Distributed Indexing To Mapreduce or Not?

Contributions

• We examine the benefits to be gained from indexing with MapReduce ( )

• We discuss 4 indexing techniques in MapReduce– 3 existing techniques– 1 novel technique inspired by single-pass indexing

• We then compare them to a shared-corpus distributed indexing strategy

Page 7: Comparing Distributed Indexing To Mapreduce or Not?

CLASSICAL INDEXING

1.Classical Indexing

2.Single-Pass Indexing

3.Shared-Nothing & Shared Corpus Distributed Indexing

Page 8: Comparing Distributed Indexing To Mapreduce or Not?

Classical Indexing

term

Total docs

Total frequency

pointerDocument number

frequency

Need to build two important structures:

• Inverted Index: posting lists for each term: <docid, frequency>

• Lexicon: term information and pointer to correct posting list

• State-of-the-art indexing uses a single-pass strategy.

(I.H. Witten, A. Moffat and T.C. Bell, 1999)

Lexicon

Posting List

Page 9: Comparing Distributed Indexing To Mapreduce or Not?

1. Parse collection

2. Build postings in memory

3. Merge inverted indices

Single-Pass In-Memory Indexing

RAMt1

t2

t3

<><> <>

<>

<> <>

Indexer

DISK Final Inverted Index

CompressedFiles

% Used

Page 10: Comparing Distributed Indexing To Mapreduce or Not?

How can Indexing be Distributed?

• Two classical approaches

– Shared-Nothing• Indexer on every machine

• Index local data

• Optimal!

– Shared-Corpus• Indexer on every machine

• Index remote data over NFS

Page 11: Comparing Distributed Indexing To Mapreduce or Not?

Problems with Classical Approaches

• Shared-Nothing– No fault tolerance– Only a single copy of data– Jobs have to be manually administered– Data may not be available locally

• Shared-Corpus– No fault tolerance– Single point of failure (server)– Constrained by network bandwidth and server speed– Jobs have to be manually administered

• MapReduce reportedly solves these issues

Page 12: Comparing Distributed Indexing To Mapreduce or Not?

MAPREDUCE INDEXING

1.MapReduce

2.MapReduce Indexing Strategies

Page 13: Comparing Distributed Indexing To Mapreduce or Not?

MapReduce

• Programming paradigm by• Splits jobs into map and reduce operations

1. Map : map function (indexing) over each entry in input

2. Sort map output locally

3. Reduce : merge map output

• Provides:– Convenient programming API– Automatic job control– Fault tolerance– Distributed data storage (DFS)– Data replication

Page 14: Comparing Distributed Indexing To Mapreduce or Not?

Indexing with MapReduce

• Implies that indexing is trivial . . .• Could a more complex approach perform better?• We try multiple approaches:

“Map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document ID’s and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.”

Dean & Ghemawat, MapReduce: Simplified data processing on large clusters. OSDI 2004

Approach Emits Sorting Num emits per map

Emit size

D&G_Token Tokens Lots Lots Tiny

D&G_Term Terms Lots Many Tiny

Nutch Documents Little Some Average

Single-Pass Posting lists Some Few Large

Emit each word in the document

A big intermediate

sort

Lots of merging

Page 15: Comparing Distributed Indexing To Mapreduce or Not?

D&G_Token & D&G_Term

• D&G_Token– Map : For each token in document

emit (token, document-ID)– Sort : by token and document-ID– Reduce : For each unique token (term)

sum repeated document-Ids to get tf’s

write posting list for that term

• D&G_Term– Map : For each term in document

emit (term, document-ID, tf)– Sort : by term and document-ID– Reduce : For each term

write posting list for that term

• Based on Dean & Ghemawat’s MapReduce Paper (OSDI 2004)

Emit every token

Emit every term

Page 16: Comparing Distributed Indexing To Mapreduce or Not?

Nutch Style Indexing

• Nutch (lucene) provides MapReduce indexing• We investigated v0.9

– Map : For each document

analyse document

emit(document-ID, analysed-document)– Sort : by document-ID (URL)– Reduce : For each document-ID

build Nutch document

index Nutch document

• Approximately equivalent to using a null-mapper• Emits less than the D&G approaches• But we believe we can do better . . .

Emit every document

Page 17: Comparing Distributed Indexing To Mapreduce or Not?

Our Single-Pass Indexing Strategy

• We propose a novel adaptation of Single-Pass indexing

• Idea:– Use local machine memory – Build useful compressed structures– Emit less, therefore less IO and less sorting

• Single-Pass Indexing– Map : For each document

add document to compressed in-memory partial index

if (memory near full) flush:

for each term in partial index

emit(term, partial posting list)

Emit limited Posting-Lists

Page 18: Comparing Distributed Indexing To Mapreduce or Not?

Our MapReduce Indexing Strategy (2)

– Sort : map, flush, and term– Reduce : for each term

merge partial posting lists

write out merged posting list

• Maps only emit compressed posting lists (less IO)• Few emits so sorting is easy

• We now evaluate these indexing strategies . . .

Page 19: Comparing Distributed Indexing To Mapreduce or Not?

EXPERIMENTATION & RESULTS •

1.Measures and Setup

2.Indexing Throughput and Scaling

Page 20: Comparing Distributed Indexing To Mapreduce or Not?

Research Questions

• Recall, we want to evaluate MapReduce for Indexing large-scale collections

• Questions:– Is Shared-Corpus Indexing sufficient for large-scale

collections (baseline)

– Can MapReduce perform close to Shared-Nothing Indexing (optimal)

Page 21: Comparing Distributed Indexing To Mapreduce or Not?

Evaluation of MapReduce Indexing

• Two measures are used:• Measure speed with throughput:

(Compressed) Collection Size

Total Time Taken to Index

• Measure scaling with speedup:Total Time Taken to Index on a 1 Machine

Total Time Taken to Index using m Machines

• Optimal speedup would be where m machines were m times faster- known as linear speedup

Page 22: Comparing Distributed Indexing To Mapreduce or Not?

Experimental Setup

• All indexing strategies are implemented in• The only fully open-source MapReduce implementation• Apache Software Foundation Project• Reportedly used by for indexing

• Cluster Setup• 4 (3) cores, 2.4GHz, 4GB RAM, ~1TB of hdd• Single gigabit Ethernet rack • v0.18.2 (HDFS)• Hadoop on Demand (HOD)• Torque Resource Manager (v2.1.9)• Also a RAID5 file server with 8 3GHz cores

• Use the TREC .GOV2 corpus• 25 million documents, 425GB uncompressed

Page 23: Comparing Distributed Indexing To Mapreduce or Not?

Target Indexing Throughput

Indexing Strategy 1 2 4 6 8

Shared-Nothing Distributed 3 6 12 18 24

Shared-Corpus Distributed

MapReduce D&G_Token

MapReduce D&G_Term

MapReduce Single-Pass

Number of Machines Allocated

Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines

• Terrier Single-Pass indexing has a throughput of 1MB/sec for a single core• We project Shared-Nothing as being optimal (linear speedup)

Page 24: Comparing Distributed Indexing To Mapreduce or Not?

Baseline Indexing Throughput

Indexing Strategy 1 2 4 6 8

Shared-Nothing Distributed 3 6 12 18 24

Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8

MapReduce D&G_Token

MapReduce D&G_Term

MapReduce Single-Pass

Number of Machines Allocated

• Use shared-corpus indexing as a baseline• No improvement after 4 machines• File server acts as a bottleneck• Can’t scale files servers indefinitely

Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines

Page 25: Comparing Distributed Indexing To Mapreduce or Not?

D&G_Token Indexing Throughput

Indexing Strategy 1 2 4 6 8

Shared-Nothing Distributed 3 6 12 18 24

Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8

MapReduce D&G_Token - - - - -

MapReduce D&G_Term

MapReduce Single-Pass

Number of Machines Allocated

• No results• Runs failed due to there being to much map output, disks ran out of space

Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines

Page 26: Comparing Distributed Indexing To Mapreduce or Not?

D&G_Term Indexing Throughput

Indexing Strategy 1 2 4 6 8

Shared-Nothing Distributed 3 6 12 18 24

Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8

MapReduce D&G_Token - - - - -

MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38

MapReduce Single-Pass

Number of Machines Allocated

• Indexing was possible• But scaling was strongly sub-linear• Still too much emitting• Half speed of Shared-Corpus approach

Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines

Page 27: Comparing Distributed Indexing To Mapreduce or Not?

Single-Pass Indexing Throughput

Indexing Strategy 1 2 4 6 8

Shared-Nothing Distributed 3 6 12 18 24

Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8

MapReduce D&G_Token - - - - -

MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38

MapReduce Single-Pass 2.59 5.19 9.45 13.16 17.31

Number of Machines Allocated

• Faster than both D&G_Term and Shared Corpus approaches• Still scales sub-linearly• Data locality is the main concern

Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines

Page 28: Comparing Distributed Indexing To Mapreduce or Not?

CONCLUSION

Page 29: Comparing Distributed Indexing To Mapreduce or Not?

Conclusions

• Shared-Corpus indexing is limited by the speed of the central file store

• Performing efficient indexing in MapReduce is not trivial – Indeed, all indexing strategies implemented here scale sub-linearly

• Single-pass indexing is only marginally sub-linear– Small price to pay for the advantages?

• Using single-pass indexing MapReduce is suitable for indexing the current generation of web corpora

– We indexed the ‘B’ set of ClueWeb09 is just under 20 hours using 5 machines (15 cores)

– It has been reported on the TRECWeb mailing list that using Indri and Amazon EC2 (4 cores) and S3 that indexing takes 31 hours with a Shared-Nothing approach

. . . plus 10 days upload time!

Page 30: Comparing Distributed Indexing To Mapreduce or Not?

Questions?

http://terrier.org