comparing distributed indexing to mapreduce or not?

Comparing Distributed Indexing:To Mapreduce or Not?

Richard McCreadie

Craig Macdonald

Iadh Ounis

Talk Outline

1. Motivations

2. Classical Indexing– Single-Pass Indexing– Shared-Nothing/Corpus Indexing

3. MapReduce– What is MapReduce?– Indexing Strategies for MapReduce

4. Experimentation and Results– Measures and Environment– Using Shared-Corpus Indexing as a baseline– Comparing MapReduce Indexing Techniques

5. Conclusions

MOTIVATIONS•

1.Why is Efficient Indexing Important

2.Contributions

Why is Efficient Indexing Important?

• Indexing is an essential part of any IR system • But test corpora have grown exponentially

• This has reinvigorated the need for efficient indexing

Collection Data Year Docs Size(GB)

WT2G Web 1999 240k 2.0

GOV Web 2002 1.8M 18.0

Blogs06 Blogs 2006 3M 13.0

GOV2 Web 2004 25M 425.0

ClueWeb09 Web 2009 1.2B 25,000

• Commercial organisations and research groups alike have embraced scale-out approaches

• MapReduce has been widely adopted by the commercial search engines

– – –

• No studies into its suitability for indexing

Solutions?

MapReduce

Contributions

• We examine the benefits to be gained from indexing with MapReduce ( )

• We discuss 4 indexing techniques in MapReduce– 3 existing techniques– 1 novel technique inspired by single-pass indexing

• We then compare them to a shared-corpus distributed indexing strategy

CLASSICAL INDEXING

1.Classical Indexing

2.Single-Pass Indexing

3.Shared-Nothing & Shared Corpus Distributed Indexing

Classical Indexing

term

Total docs

Total frequency

pointerDocument number

frequency

Need to build two important structures:

• Inverted Index: posting lists for each term: <docid, frequency>

• Lexicon: term information and pointer to correct posting list

• State-of-the-art indexing uses a single-pass strategy.

(I.H. Witten, A. Moffat and T.C. Bell, 1999)

Lexicon

Posting List

1. Parse collection

2. Build postings in memory

3. Merge inverted indices

Single-Pass In-Memory Indexing

RAMt1

t2

t3

<><> <>

<>

<> <>

Indexer

DISK Final Inverted Index

CompressedFiles

% Used

How can Indexing be Distributed?

• Two classical approaches

– Shared-Nothing• Indexer on every machine

• Index local data

• Optimal!

– Shared-Corpus• Indexer on every machine

• Index remote data over NFS

Problems with Classical Approaches

• Shared-Nothing– No fault tolerance– Only a single copy of data– Jobs have to be manually administered– Data may not be available locally

• Shared-Corpus– No fault tolerance– Single point of failure (server)– Constrained by network bandwidth and server speed– Jobs have to be manually administered

• MapReduce reportedly solves these issues

MAPREDUCE INDEXING

1.MapReduce

2.MapReduce Indexing Strategies

MapReduce

• Programming paradigm by• Splits jobs into map and reduce operations

1. Map : map function (indexing) over each entry in input

2. Sort map output locally

3. Reduce : merge map output

• Provides:– Convenient programming API– Automatic job control– Fault tolerance– Distributed data storage (DFS)– Data replication

Indexing with MapReduce

• Implies that indexing is trivial . . .• Could a more complex approach perform better?• We try multiple approaches:

“Map function parses each document, and emits a sequence of <word, document ID> pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document ID’s and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions.”

Dean & Ghemawat, MapReduce: Simplified data processing on large clusters. OSDI 2004

Approach Emits Sorting Num emits per map

Emit size

D&G_Token Tokens Lots Lots Tiny

D&G_Term Terms Lots Many Tiny

Nutch Documents Little Some Average

Single-Pass Posting lists Some Few Large

Emit each word in the document

A big intermediate

sort

Lots of merging

D&G_Token & D&G_Term

• D&G_Token– Map : For each token in document

emit (token, document-ID)– Sort : by token and document-ID– Reduce : For each unique token (term)

sum repeated document-Ids to get tf’s

write posting list for that term

• D&G_Term– Map : For each term in document

emit (term, document-ID, tf)– Sort : by term and document-ID– Reduce : For each term

write posting list for that term

• Based on Dean & Ghemawat’s MapReduce Paper (OSDI 2004)

Emit every token

Emit every term

Nutch Style Indexing

• Nutch (lucene) provides MapReduce indexing• We investigated v0.9

– Map : For each document

analyse document

emit(document-ID, analysed-document)– Sort : by document-ID (URL)– Reduce : For each document-ID

build Nutch document

index Nutch document

• Approximately equivalent to using a null-mapper• Emits less than the D&G approaches• But we believe we can do better . . .

Emit every document

Our Single-Pass Indexing Strategy

• We propose a novel adaptation of Single-Pass indexing

• Idea:– Use local machine memory – Build useful compressed structures– Emit less, therefore less IO and less sorting

• Single-Pass Indexing– Map : For each document

add document to compressed in-memory partial index

if (memory near full) flush:

for each term in partial index

emit(term, partial posting list)

Emit limited Posting-Lists

Our MapReduce Indexing Strategy (2)

– Sort : map, flush, and term– Reduce : for each term

merge partial posting lists

write out merged posting list

• Maps only emit compressed posting lists (less IO)• Few emits so sorting is easy

• We now evaluate these indexing strategies . . .

EXPERIMENTATION & RESULTS •

1.Measures and Setup

2.Indexing Throughput and Scaling

Research Questions

• Recall, we want to evaluate MapReduce for Indexing large-scale collections

• Questions:– Is Shared-Corpus Indexing sufficient for large-scale

collections (baseline)

– Can MapReduce perform close to Shared-Nothing Indexing (optimal)

Evaluation of MapReduce Indexing

• Two measures are used:• Measure speed with throughput:

(Compressed) Collection Size

Total Time Taken to Index

• Measure scaling with speedup:Total Time Taken to Index on a 1 Machine

Total Time Taken to Index using m Machines

• Optimal speedup would be where m machines were m times faster- known as linear speedup

Experimental Setup

• All indexing strategies are implemented in• The only fully open-source MapReduce implementation• Apache Software Foundation Project• Reportedly used by for indexing

• Cluster Setup• 4 (3) cores, 2.4GHz, 4GB RAM, ~1TB of hdd• Single gigabit Ethernet rack • v0.18.2 (HDFS)• Hadoop on Demand (HOD)• Torque Resource Manager (v2.1.9)• Also a RAID5 file server with 8 3GHz cores

• Use the TREC .GOV2 corpus• 25 million documents, 425GB uncompressed

Target Indexing Throughput

Indexing Strategy 1 2 4 6 8

Shared-Nothing Distributed 3 6 12 18 24

Shared-Corpus Distributed

MapReduce D&G_Token

MapReduce D&G_Term

MapReduce Single-Pass

Number of Machines Allocated

Table 1 : Throughput (MB/sec) when indexing .GOV2 with m machines

• Terrier Single-Pass indexing has a throughput of 1MB/sec for a single core• We project Shared-Nothing as being optimal (linear speedup)

Baseline Indexing Throughput



Shared-Corpus Distributed 2.44 4.6 12.8 12.4 12.8

MapReduce D&G_Token

MapReduce D&G_Term



• Use shared-corpus indexing as a baseline• No improvement after 4 machines• File server acts as a bottleneck• Can’t scale files servers indefinitely


D&G_Token Indexing Throughput




MapReduce D&G_Token - - - - -

MapReduce D&G_Term



• No results• Runs failed due to there being to much map output, disks ran out of space


D&G_Term Indexing Throughput





MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38



• Indexing was possible• But scaling was strongly sub-linear• Still too much emitting• Half speed of Shared-Corpus approach


Single-Pass Indexing Throughput





MapReduce D&G_Term 1.15 1.59 4.01 4.71 6.38

MapReduce Single-Pass 2.59 5.19 9.45 13.16 17.31


• Faster than both D&G_Term and Shared Corpus approaches• Still scales sub-linearly• Data locality is the main concern


CONCLUSION

Conclusions

• Shared-Corpus indexing is limited by the speed of the central file store

• Performing efficient indexing in MapReduce is not trivial – Indeed, all indexing strategies implemented here scale sub-linearly

• Single-pass indexing is only marginally sub-linear– Small price to pay for the advantages?

• Using single-pass indexing MapReduce is suitable for indexing the current generation of web corpora

– We indexed the ‘B’ set of ClueWeb09 is just under 20 hours using 5 machines (15 cores)

– It has been reported on the TRECWeb mailing list that using Indri and Amazon EC2 (4 cores) and S3 that indexing takes 31 hours with a Shared-Nothing approach

. . . plus 10 days upload time!

Questions?

http://terrier.org

comparing distributed indexing to mapreduce or not?

Technology

indexing solutions

sharedcorpus indexing

map function indexing

classical indexing need

list stateoftheart indexing

approaches mapreduce

mapreduce experimentation

mapreduce programming