implementing scalable tree-based algorithms using mrnet
DESCRIPTION
Implementing Scalable Tree-based Algorithms using MRNet. Ting Chen Mark Cowlishaw. Introduction. Background on MRNet Our Applications Reverse index Online queries Description of Experiments Progress Next steps. Background: MRNet [Roth, Arnold, Miller03]. Tree-Based Overlay Network - PowerPoint PPT PresentationTRANSCRIPT
1
Implementing Scalable Tree-based Algorithms
using MRNet
Ting ChenMark Cowlishaw
2
Introduction
• Background on MRNet
• Our Applications– Reverse index– Online queries
• Description of Experiments
• Progress
• Next steps
3
Background: MRNet [Roth, Arnold, Miller03]
• Tree-Based Overlay Network– Nodes of a distributed application are
arranged in a tree-structure– Leaves producing data that is aggregated and
filtered by higher levels of the tree– Separation of programs and their running tree
topology• A TBON program can run on tree-network of
different topologies
4
Reverse indexes for Keyword Queries• A keyword query is a list of words: < w1,w2, ...,wn > .• A typical reverse index is a list of words
– Each word points to a list of document IDs containing the word.
– A document ID list can either be sorted by document names or by the number of times the word appears in the document.
DocId No. of Occurences
A.htm
B.htm
C.htm
….
2
7
5
…
DocId No. of Occurences
B.htm
C.htm
….
6
4
…
Wisconsin Badgers
5
On-line queries: N-best frequency (N=2)
DocD: 0DocB: 3 DocC: 7DocA: 5
DocB: 3 DocC: 7
DocA: 5DocD: 0
DocF: 6
DocF: 6
DocE: 2
6
Objective / Experiments
• How does tree topology affect application performance– Macro-benchmarks: throughput/time-to-
completion for index building and response time for on-line queries
– Micro-benchmarks: the amount of total IO, I/O performed for each node, data transferred
– Scale-Up and Speed-Up curves with the increase of cluster nodes
– Is Multiple-level ( > 2) helpful?
7
Progress - data
• Experiment Data – All Wikipedia documents – 8GB of data– 4 Million documents
• Probably enough for a tree with 32 leaves
– Data in Wiki-text format
8
Progress – Design / Implementation
• Message Formats – Document/keyword delivery– Acknowledgment– keys / statistics– Messages for Microbenchmarks
• Shared Classes – DocumentStatistics (back end) [in test]– StatisticsEntry (all) [in test]– StatisticsList (all)
• Familiarity with MRNet Toolset– Debugging MRNet Programs
9
Next Steps
• Deploy at medium scale (~32 Nodes)
• Experiments to determine fan-out– Maximize throughput (bytes/second)– Minimize time-to-completion– Microbenchmarks
• Collected and filtered using MRNet• Idle time• Messages per second, total message traffic
10
Futures
• Replace N-best with distance from median • Add support for new document types
– XML– HTML
• Generalize to other MapReduce[Dean04] Applications– More realistic relevance ranking– Reverse hyperlink count– Collection term vectors