implementing scalable tree-based algorithms using mrnet

10
1 Implementing Scalable Tree-based Algorithms using MRNet Ting Chen Mark Cowlishaw

Upload: urban

Post on 19-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Implementing Scalable Tree-based Algorithms using MRNet. Ting Chen Mark Cowlishaw. Introduction. Background on MRNet Our Applications Reverse index Online queries Description of Experiments Progress Next steps. Background: MRNet [Roth, Arnold, Miller03]. Tree-Based Overlay Network - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Implementing Scalable  Tree-based Algorithms  using MRNet

1

Implementing Scalable Tree-based Algorithms

using MRNet

Ting ChenMark Cowlishaw

Page 2: Implementing Scalable  Tree-based Algorithms  using MRNet

2

Introduction

• Background on MRNet

• Our Applications– Reverse index– Online queries

• Description of Experiments

• Progress

• Next steps

Page 3: Implementing Scalable  Tree-based Algorithms  using MRNet

3

Background: MRNet [Roth, Arnold, Miller03]

• Tree-Based Overlay Network– Nodes of a distributed application are

arranged in a tree-structure– Leaves producing data that is aggregated and

filtered by higher levels of the tree– Separation of programs and their running tree

topology• A TBON program can run on tree-network of

different topologies

Page 4: Implementing Scalable  Tree-based Algorithms  using MRNet

4

Reverse indexes for Keyword Queries• A keyword query is a list of words: < w1,w2, ...,wn > .• A typical reverse index is a list of words

– Each word points to a list of document IDs containing the word.

– A document ID list can either be sorted by document names or by the number of times the word appears in the document.

DocId No. of Occurences

A.htm

B.htm

C.htm

….

2

7

5

DocId No. of Occurences

B.htm

C.htm

….

6

4

Wisconsin Badgers

Page 5: Implementing Scalable  Tree-based Algorithms  using MRNet

5

On-line queries: N-best frequency (N=2)

DocD: 0DocB: 3 DocC: 7DocA: 5

DocB: 3 DocC: 7

DocA: 5DocD: 0

DocF: 6

DocF: 6

DocE: 2

Page 6: Implementing Scalable  Tree-based Algorithms  using MRNet

6

Objective / Experiments

• How does tree topology affect application performance– Macro-benchmarks: throughput/time-to-

completion for index building and response time for on-line queries

– Micro-benchmarks: the amount of total IO, I/O performed for each node, data transferred

– Scale-Up and Speed-Up curves with the increase of cluster nodes

– Is Multiple-level ( > 2) helpful?

Page 7: Implementing Scalable  Tree-based Algorithms  using MRNet

7

Progress - data

• Experiment Data – All Wikipedia documents – 8GB of data– 4 Million documents

• Probably enough for a tree with 32 leaves

– Data in Wiki-text format

Page 8: Implementing Scalable  Tree-based Algorithms  using MRNet

8

Progress – Design / Implementation

• Message Formats – Document/keyword delivery– Acknowledgment– keys / statistics– Messages for Microbenchmarks

• Shared Classes – DocumentStatistics (back end) [in test]– StatisticsEntry (all) [in test]– StatisticsList (all)

• Familiarity with MRNet Toolset– Debugging MRNet Programs

Page 9: Implementing Scalable  Tree-based Algorithms  using MRNet

9

Next Steps

• Deploy at medium scale (~32 Nodes)

• Experiments to determine fan-out– Maximize throughput (bytes/second)– Minimize time-to-completion– Microbenchmarks

• Collected and filtered using MRNet• Idle time• Messages per second, total message traffic

Page 10: Implementing Scalable  Tree-based Algorithms  using MRNet

10

Futures

• Replace N-best with distance from median • Add support for new document types

– XML– HTML

• Generalize to other MapReduce[Dean04] Applications– More realistic relevance ranking– Reverse hyperlink count– Collection term vectors