imapreduce : a distributed computing framework for iterative computation

22
iMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University, China Lixin Gao, UMass Amherst Cuirong Wang, Northeastern University, China

Upload: tameka

Post on 24-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

iMapReduce : A Distributed Computing Framework for Iterative Computation. Yanfeng Zhang, Northeastern University, China Qixin Gao , Northeastern University, China Lixin Gao , UMass Amherst Cuirong Wang, Northeastern University, China. Iterative Computation Everywhere. Video suggestion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: iMapReduce :  A Distributed Computing Framework for Iterative Computation

iMapReduce: A Distributed Computing Framework for

Iterative Computation

Yanfeng Zhang, Northeastern University, ChinaQixin Gao, Northeastern University, China

Lixin Gao, UMass AmherstCuirong Wang, Northeastern University, China

Page 2: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Iterative Computation Everywhere

PageRank

Clustering

BFSVideo suggestion

Pattern Recognition

Page 3: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Iterative Computation on Massive Data

PageRank

Clustering

BFSVideo suggestion

Pattern Recognition

100 million videos

700PB human genomics

29.7 billion pages

500TB/yr hospital

Google maps 70.5TB

Page 4: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Large-Scale Solution MapReduce

• Programming model– Map: distribute computation workload– Reduce: aggregate the partial results

• But, no support of iterative processing– Batched processing– Must design a series of jobs, do the same

thing many times• Task reassignment, data loading, data dumping

Page 5: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Outline

• Introduction• Example: PageRank in MapReduce• System Design: iMapReduce• Evaluation• Conclusions

Page 6: iMapReduce :  A Distributed Computing Framework for Iterative Computation

PageRank• Each node associated with a ranking score R

– Evenly to its neighbors with portion d– Retain portion (1-d)

Page 7: iMapReduce :  A Distributed Computing Framework for Iterative Computation

PageRank in MapReduce

0 1:2,31 1:32 1:0,3

3 1:44 1:1,2,3

map

map

2 1/23 1/2

reduce

reduce4 1 3 11 1/3

2 1/3

3 1/3

0 1/2

3 1/2

1 0.333 2.33

0 0.52 0.834 1

0 p:2,31 p:32 p:0,3

3 p:44 p:1,2,3

0 0.5:2,32 0.83:0,34 1:1,2,3

1 0.33:33 2.33:4

for the next MapReduce job

Page 8: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Problems of Iterative Implementation

• 1. Multiple times job/task initializations

• 2. Re-shuffling the static data

• 3. Synchronization barrier between jobs

Page 9: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Outline

• Introduction• Example: PageRank• System Design: iMapReduce• Evaluation• Conclusions

Page 10: iMapReduce :  A Distributed Computing Framework for Iterative Computation

iMapReduce Solutions

• 1. Multiple job/task initializations– Iterative processing

• 2. Re-shuffling the static data– Maintain static data locally

• 3. Synchronization barrier between jobs– Asynchronous map execution

Page 11: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Iterative Processing

• Reduce Map– Load/dump once– Persistent map/reduce

task– Each reduce task has a

correspondent map task– Locally connected

Page 12: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Maintain Static Data Locally

• Iterate state data– Update iteratively

• Maintain static data– Join with state data at

map phase

Page 13: iMapReduce :  A Distributed Computing Framework for Iterative Computation

State Data Static Data

0 12 14 1

1 13 1

map 0

map 1

2 1/23 1/2

reduce 0

reduce 14 1 3 11 1/3

2 1/3

3 1/3

0 1/2

3 1/2

1 0.333 2.33

0 0.52 0.834 1

for the next iteration

0 2,32 0,34 1,2,3

1 33 4

Page 14: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Asynchronous Map Execution

• Persistent socket reduce -> map• Reduce completion directly trigger map• Map tasks no need to wait

MapReduce iMapReduce

time

Page 15: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Outline

• Introduction• Example: PageRank• System Design: iMapReduce• Evaluation• Conclusions

Page 16: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Experiment Setup • Cluster

– Local cluster: 4 nodes– Amazon EC2 cluster: 80 small instances

• Implemented algorithms– Shortest path, PageRank, K-Means

• Data sets

SSSP PageRank

Page 17: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Results on Local Cluster

PageRank: google webgraph

iMapReduce

Hadoop

Sync. iMapReduce

Hadoop ex. initialization

Page 18: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Results cont.

PageRank: google webgraph PageRank: Berk-Stan webgraph

SSSP: Facebook datasetSSSP: DBLP dataset

Page 19: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Scale to Large Data Sets

• On Amazon EC2 cluster• Various-size graphs

PageRankSSSP

Page 20: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Different Factors on Running Time Reduction

iMapReduce

Shuffling

Sync. mapsInit time

Hadoop

Page 21: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Conclusion

• iMapReduce reduces running time by– Reducing the overhead of job/task

initialization– Eliminating the shuffling of static data– Allowing asynchronous map execution

• Achieve a factor of 1.2x to 5x speedup• Suitable for most iterative computations

Page 22: iMapReduce :  A Distributed Computing Framework for Iterative Computation

Q & A

• Thank you!