apache giraph

22
A Distributed Graph-Processing Library Ahmet Emre Aladağ - AGMLab 26.08.2013

Upload: ahmet-emre-aladag

Post on 11-Aug-2014

1.216 views

Category:

Career


1 download

DESCRIPTION

An introduction to Apache Giraph.

TRANSCRIPT

Page 1: Apache Giraph

A Distributed Graph-Processing Library

Ahmet Emre Aladağ - AGMLab26.08.2013

Page 2: Apache Giraph

● Library for large-scale graph processing.● Runs on Apache Hadoop with Map Jobs● Bulk Synchronous Parallel (BSP) model

What is Giraph?

1incoming messages

outgoing messages

0.2

0.53

0.320.16

0.12

0.34

Vertex computation

Page 3: Apache Giraph

Uses

● PageRank-variant iterative algorithms● Graph clustering

○ Label propagation○ Max Clique○ Triangle Closure○ Finding related people, groups, interests.

● Shortest-Path○ Single source, s-t, all to all

● Finding Connected Components

Page 4: Apache Giraph

Alternatives

● Map-Reduce jobs on Hadoop○ Not a good fit for graph algorithms: overhead.

● Google Pregel○ Requires its own infrastructure○ Not available○ Master is single point of failure.

● Message Passing Interface (MPI)○ Not fault-tolerant○ Too generic

Page 5: Apache Giraph

How Giraph differs

● You can use a Hadoop cluster, no need for special infrastructure.

● Easy deployment with Amazon EMR● Dynamic resource management● Graph oriented API● Open Source● Fault Tolerant, no SPOF except Hadoop

namenode and jobtracker● Jython Support

Page 6: Apache Giraph

Layers

Page 7: Apache Giraph

Mechanism

InputFormat/Reader

Input

Computation OutputFormat/Writer

Output

● Accumulo● HBase● HCatalog● HDFS● Hive● Neo4j etc.

● Accumulo● HBase● HCatalog● HDFS● Hive● Neo4j etc.● GraphViz

Adjacency matrix, id-value pairs, JSON

Page 8: Apache Giraph

InputFormat

● VertexInputFormat1;3.42;6.13;2.7

● EdgeInputFormat1;22;31;3

1 2 3

3.4 6.1 2.7

1 2 3

Page 9: Apache Giraph

Computation● Superstep barriers.● Send/Receive messages from neighbors● Update value.● Vote to halt or wake up.

Single-Source Shortest Path Example

Page 10: Apache Giraph

Shortest-Path Computation Code

Note: old API

Page 11: Apache Giraph

Ex: Finding the maximum value

Page 12: Apache Giraph

Aggregators

● Shared variables among the workers.● Each vertex computation can add/multiply a

value to aggregators. ● Examples:

○ Holding the min/max value among all vertices○ Holding sum of the vertex values.○ Holding average value of vertex values.○ Holding sum of mean square errors and stdev.

1 2 30.2 0.6

0.45

1.25

Computation at Iteration k

Page 13: Apache Giraph

MasterCompute Class

● Master’s compute() always runs before the slaves (like pre-superstep)

○ In compute: aggregate vertex values: sum of values○ In MasterCompute: average=sum/N

● Aggregators are registered here.

● You can set values to aggregators.

Page 14: Apache Giraph

Worker Context

● Allows for the execution of user code on a per-worker basis.

● There's one WorkerContext per worker.

● Methods for Pre/post superstep/application operations.

Page 15: Apache Giraph

Flexible Edge/Vertex Input

● Read edges/vertices from different sources.● Multiple input resources

Page 16: Apache Giraph

Parallel Computing

● More map jobs (workers) = parallel computing

● To overcome slowest worker problem, multithreading is applied on input/computation/output

● Linear speedup in CPU-bound applications such as k-means clustering due to multithreading

● Take a set of entrie machines & use multithreading to maximize resource utilization.

Page 17: Apache Giraph

Memory Optimization

● Vertices and edges are stored as serialized byte arrays.

● Used FastUtil-based Java primitives.

Page 18: Apache Giraph

Sharded Aggregators

● Each aggregator is randomly assigned to one of the workers.

● The assigned worker is in charge of gathering the values of its aggregators from all workers, performing the aggregation, and distributing the final values to other workers.

● Aggregation responsibilities are balanced across all workers rather than bottlenecked by the master.

Page 19: Apache Giraph

Performance● PageRank on 1 trillion edges with 200 commodity

machines: 4 minutes/iteration.● K-Means on 1 billion input vectors x 100 features into

10.000 centroids: 10 minutes.● Linear Scalability

Page 20: Apache Giraph

Currently

● Version 1.0, on the way to 1.1● Changing rapidly: backwards-incompatible

changes● Documentation not mature yet.● More algorithms to be contributed.● More data sources to be ported.● http://giraph.apache.org for more info

Page 21: Apache Giraph

References

Giraph: Large-scale graph processing infrastructure on Hadoop, 2011

Scaling Apache Giraph to a trillion edges, Avery Ching, Facebook, 2013

Scaling Apache Giraph, Nitay Joffe, Facebook, 2013.

Giraph: http://giraph.apache.org

Page 22: Apache Giraph

Questions

?