data-intensive distributed computing - roegiestdata-intensive distributed computing part 8:...

74
Data-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 431/631 451/651 (Winter 2019) Adam Roegiest Kira Systems March 21, 2019 These slides are available at http://roegiest.com/bigdata-2019w/

Upload: others

Post on 23-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Data-Intensive Distributed Computing

Part 8: Analyzing Graphs, Redux (1/2)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Winter 2019)

Adam RoegiestKira Systems

March 21, 2019

These slides are available at http://roegiest.com/bigdata-2019w/

Page 2: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Graph Algorithms, again?(srsly?)

Page 3: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

What makes graphs hard?

Irregular structureFun with data structures!

Irregular data access patternsFun with architectures!

IterationsFun with optimizations!

Page 4: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Characteristics of Graph Algorithms

Parallel graph traversalsLocal computations

Message passing along graph edges

Iterations

Page 5: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

n0

n3n2

n1

n7

n6

n5

n4

n9

n8

Visualizing Parallel BFS

Page 6: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Given page x with inlinks t1…tn, where

C(t) is the out-degree of t

is probability of random jump

N is the total number of nodes in the graph

X

t1

t2

tn

PageRank: Defined

Page 7: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

n2 n4 n3 n5 n1 n2 n3n4 n5

n2 n4n3 n5n1 n2 n3 n4 n5

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

Map

Reduce

PageRank in MapReduce

Page 8: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Map

Reduce

PageRank BFS

PR/N d+1

sum min

PageRank vs. BFS

Page 9: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Characteristics of Graph Algorithms

Parallel graph traversalsLocal computations

Message passing along graph edges

Iterations

Page 10: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

reduce

map

HDFS

HDFS

Convergence?

BFS

Page 11: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Convergence?reduce

map

HDFS

HDFS

map

HDFS

PageRank

Page 12: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

MapReduce Sucks

Hadoop task startup time

Stragglers

Needless graph shuffling

Checkpointing at each iteration

Page 13: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

reduce

HDFS

map

HDFS

reduce

map

HDFS

reduce

map

HDFS

Let’s Spark!

Page 14: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

reduce

HDFS

map

reduce

map

reduce

map

Page 15: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

reduce

HDFS

map

reduce

map

reduce

map

Adjacency Lists PageRank Mass

Adjacency Lists PageRank Mass

Adjacency Lists PageRank Mass

Page 16: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

join

HDFS

map

join

map

join

map

Adjacency Lists PageRank Mass

Adjacency Lists PageRank Mass

Adjacency Lists PageRank Mass

Page 17: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

join

join

join

HDFS HDFS

Adjacency Lists PageRank vector

PageRank vector

flatMap

reduceByKey

PageRank vector

flatMap

reduceByKey

Page 18: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

join

join

join

HDFS HDFS

Adjacency Lists PageRank vector

PageRank vector

flatMap

reduceByKey

PageRank vector

flatMap

reduceByKey

Cache!

Page 19: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

PageRankPerformance

171

80

72

28

020406080100120140160180

30 60

Tim

eperIteration(s)

Numberofmachines

Hadoop

Spark

Source: http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-part-2-amp-camp-2012-standalone-programs.pdf

MapReduce vs. Spark

Page 20: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Characteristics of Graph Algorithms

Parallel graph traversalsLocal computations

Message passing along graph edges

Iterations

Even faster?

Page 21: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Big Data Processing in a Nutshell

Partition

Replicate

Reduce cross-partition communication

Page 22: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Simple Partitioning Techniques

Hash partitioning

Range partitioning on some underlying linearizationWeb pages: lexicographic sort of domain-reversed URLs

Page 23: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

“Best Practices”

Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.

PageRank over webgraph(40m vertices, 1.4b edges)

How much difference does it make?

Page 24: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

+18%1.4b

674m

Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.

PageRank over webgraph(40m vertices, 1.4b edges)

How much difference does it make?

Page 25: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

+18%

-15%

1.4b

674m

Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.

PageRank over webgraph(40m vertices, 1.4b edges)

How much difference does it make?

Page 26: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

+18%

-15%

-60%

1.4b

674m

86m

Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.

PageRank over webgraph(40m vertices, 1.4b edges)

How much difference does it make?

Page 27: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Schimmy Design Pattern

Basic implementation contains two dataflows:Messages (actual computations)Graph structure (“bookkeeping”)

Schimmy: separate the two dataflows, shuffle only the messagesBasic idea: merge join between graph structure and messages

Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.

S T

both relations sorted by join key

S1 T1 S2 T2 S3 T3

both relations consistently partitioned and sorted by join key

Page 28: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

join

join

join

HDFS HDFS

Adjacency Lists PageRank vector

PageRank vector

flatMap

reduceByKey

PageRank vector

flatMap

reduceByKey

Page 29: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

+18%

-15%

-60%

1.4b

674m

86m

Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.

PageRank over webgraph(40m vertices, 1.4b edges)

How much difference does it make?

Page 30: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

+18%

-15%

-60%-69%

1.4b

674m

86m

Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.

PageRank over webgraph(40m vertices, 1.4b edges)

How much difference does it make?

Page 31: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Simple Partitioning Techniques

Hash partitioning

Range partitioning on some underlying linearizationWeb pages: lexicographic sort of domain-reversed URLsWeb pages: lexicographic sort of domain-reversed URLs

Social networks: sort by demographic characteristics

Page 32: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Ugander et al. (2011) The Anatomy of the Facebook Social Graph.

Analysis of 721 million active users (May 2011)

54 countries w/ >1m active users, >50% penetration

Country Structure in Facebook

Page 33: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Simple Partitioning Techniques

Hash partitioning

Range partitioning on some underlying linearizationWeb pages: lexicographic sort of domain-reversed URLs

Social networks: sort by demographic characteristicsWeb pages: lexicographic sort of domain-reversed URLs

Social networks: sort by demographic characteristicsGeo data: space-filling curves

Page 34: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Aside: Partitioning Geo-data

Page 35: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Geo-data = regular graph

Page 36: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Space-filling curves: Z-Order Curves

Page 37: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Space-filling curves: Hilbert Curves

Page 38: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Simple Partitioning Techniques

Hash partitioning

Range partitioning on some underlying linearizationWeb pages: lexicographic sort of domain-reversed URLs

Social networks: sort by demographic characteristicsGeo data: space-filling curves

But what about graphs in general?

Page 39: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Source: http://www.flickr.com/photos/fusedforces/4324320625/

Page 40: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

General-Purpose Graph Partitioning

Graph coarsening

Recursive bisection

Page 41: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Karypis and Kumar. (1998) A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.

General-Purpose Graph Partitioning

Page 42: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Karypis and Kumar. (1998) A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.

Graph Coarsening

Page 43: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Chicken-and-Egg

To coarsen the graph you need to identify dense local regions

To identify dense local regions quickly you to need traverse local edgesBut to traverse local edges efficiently you need the local structure!

To efficiently partition the graph, you need to already know what the partitions are!

Industry solution?

Page 44: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Big Data Processing in a Nutshell

Partition

Replicate

Reduce cross-partition communication

Page 45: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Partition

Page 46: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Partition

What’s the fundamental issue?

Page 47: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Characteristics of Graph Algorithms

Parallel graph traversalsLocal computations

Message passing along graph edges

Iterations

Page 48: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Partition

FastFast

Slow

Page 49: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

State-of-the-Art Distributed Graph Algorithms

Fast asynchronous iterations

Fast asynchronous iterations

Periodic synchronization

Page 50: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Source: Wikipedia (Waste container)

Graph Processing Frameworks

Page 51: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

join

join

join

HDFS HDFS

Adjacency Lists PageRank vector

PageRank vector

flatMap

reduceByKey

PageRank vector

flatMap

reduceByKey

Cache!

Page 52: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Pregel: Computational Model

Based on Bulk Synchronous Parallel (BSP)Computational units encoded in a directed graphComputation proceeds in a series of supersteps

Message passing architecture

Each vertex, at each superstep:Receives messages directed at it from previous superstep

Executes a user-defined function (modifying state)Emits messages to other vertices (for the next superstep)

Termination:A vertex can choose to deactivate itselfIs “woken up” if new messages received

Computation halts when all vertices are inactive

Page 53: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

superstep t

superstep t+1

superstep t+2

Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Page 54: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Pregel: Implementation

Master-Worker architectureVertices are hash partitioned (by default) and assigned to workers

Everything happens in memory

Processing cycle:Master tells all workers to advance a single superstep

Worker delivers messages from previous superstep, executing vertex computationMessages sent asynchronously (in batches)

Worker notifies master of number of active vertices

Fault toleranceCheckpointing

Heartbeat/revert

Page 55: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

class ShortestPathVertex : public Vertex<int, int, int> {void Compute(MessageIterator* msgs) {int mindist = IsSource(vertex_id()) ? 0 : INF;for (; !msgs->Done(); msgs->Next())

mindist = min(mindist, msgs->Value());if (mindist < GetValue()) {

*MutableValue() = mindist;OutEdgeIterator iter = GetOutEdgeIterator();for (; !iter.Done(); iter.Next())SendMessageTo(iter.Target(),

mindist + iter.GetValue());}VoteToHalt();

}};

Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Pregel: SSSP

Page 56: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

class PageRankVertex : public Vertex<double, void, double> {public:virtual void Compute(MessageIterator* msgs) {if (superstep() >= 1) {

double sum = 0;for (; !msgs->Done(); msgs->Next())sum += msgs->Value();

*MutableValue() = 0.15 / NumVertices() + 0.85 * sum;}

if (superstep() < 30) {const int64 n = GetOutEdgeIterator().size();SendMessageToAllNeighbors(GetValue() / n);

} else {VoteToHalt();

}}

};

Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Pregel: PageRank

Page 57: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

class MinIntCombiner : public Combiner<int> {virtual void Combine(MessageIterator* msgs) {

int mindist = INF;for (; !msgs->Done(); msgs->Next())mindist = min(mindist, msgs->Value());Output("combined_source", mindist);

}

};

Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.

Pregel: Combiners

Page 58: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share
Page 59: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Giraph Architecture

Master – Application coordinatorSynchronizes supersteps

Assigns partitions to workers before superstep begins

Workers – Computation & messagingHandle I/O – reading and writing the graph

Computation/messaging of assigned partitions

ZooKeeperMaintains global application state

Page 60: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Part 0

Part 1

Part 2

Part 3

Compute / Send

Messages

Wo

rker

1

Compute / Send

Messages

Mas

ter

Wo

rker

0

In-memory graph

Send stats / iterate!

Compute/Iterate

2

Wo

rke

r 1

Wo

rker

0 Part 0

Part 1

Part 2

Part 3

Output format

Part 0

Part 1

Part 2

Part 3

Storing the graph

3

Split 0

Split 1

Split 2

Split 3

Wo

rker

1

Mas

ter

Wo

rker

0

Input format

Load / Send

Graph

Load / Send

Graph

Loading the graph

1

Split 4

Split

Giraph Dataflow

Page 61: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Active Inactive

Vote to Halt

Received Message

Vertex Lifecycle

Giraph Lifecycle

Page 62: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Output

All Vertices Halted?

Input

Compute Superstep

No

Master halted?

No

Yes

Yes

Giraph Lifecycle

Page 63: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Giraph Example

Page 64: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

5

15

2

5

5

25

5

5

5

5

1

2

Processor 1

Processor 2

Time

Execution Trace

Page 65: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

join

join

join

HDFS HDFS

Adjacency Lists PageRank vector

PageRank vector

flatMap

reduceByKey

PageRank vector

flatMap

reduceByKey

Cache!

Page 66: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

State-of-the-Art Distributed Graph Algorithms

Fast asynchronous iterations

Fast asynchronous iterations

Periodic synchronization

Page 67: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Source: Wikipedia (Waste container)

Graph Processing Frameworks

Page 68: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

GraphX: Motivation

Page 69: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

GraphX = Spark for Graphs

Integration of record-oriented and graph-oriented processing

Extends RDDs to Resilient Distributed Property Graphs

class Graph[VD, ED] {val vertices: VertexRDD[VD]val edges: EdgeRDD[ED]

}

Page 70: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Property Graph: Example

Page 71: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Underneath the Covers

Page 72: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

GraphX Operators

val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] val triplets: RDD[EdgeTriplet[VD, ED]]

“collection” view

Transform vertices and edgesmapVerticesmapEdgesmapTriplets

Join vertices with external table

Aggregate messages within local neighborhood

Pregel programs

Page 73: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

join

join

join

HDFS HDFS

Adjacency Lists PageRank vector

PageRank vector

flatMap

reduceByKey

PageRank vector

flatMap

reduceByKey

Cache!

Page 74: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Source: Wikipedia (Japanese rock garden)