processing large graphs in hadoop
TRANSCRIPT
Processing LargeGraphs in Hadoop
Dani Sol
Index
The Problem
Google's Pregel
Example
Apache Giraph
The Problem
Processing graphs in MR is not practical:Most algorithms are iterative
Each iteration is mapped to a MR Job
Takes too long if many iterations are required
Writing MR for graph processing is not easy
Google's Pregel
Framework for iterative large graph processing
Inspired by Bulk Synchronous Parallel model
Computation is distributed among N+1 nodesN workers that do the actual work
1 master that synchronizes them
Takes a vertex-centric approachIs much easier to focus on the algorithm
http://kowshik.github.io/JPregel/pregel_paper.pdf
Pregel Main Concepts
Computations are a sequence of supersteps
Vertices are randomly distributed among nodes
Vertices have values and directed edges to other vertices
Vertices can send messages to other vertices
Messages sent at superstep S are received at superstep S + 1
Computation Life Cycle
Initially, all vertices are active
Inactive vertices activate again on receiving messages
In each superstep, active vertices:Receive messages from the previous superstep
Can change their value depending on their state
Can check the value of their neighbors
Can send messages to other vertices
Can vote to halt, becoming inactive
When all vertices are inactive, computation ends
Ex: Shortest Path AD
Single source shortest paths example
Want to find the shortest path from A to D
For simplicity, edges have value 1
Ex: Shortest Path AD
A: 0 B: C: D: E: Superstep 0: All vertices active, A sends messages and halts
0+1
0+1
0+1
Ex: Shortest Path AD
A: 0 B: 1C: 1D: E: 1Superstep 1: B, C, E get the messages and update their values
1+1
1+1
Ex: Shortest Path AD
A: 0 B: 1C: 1D: 2E: 1Superstep 2: E gets mssge from B, but doesn't change its value
Ex: Shortest Path AD
A: 0 B: 1C: 1D: 2E: 1Superstep 3: All vertices have halted and the computation ends
Apache Giraph
Open-source implementation of Pregel
Started by Yahoo, used by FB, LinkedIn, Twitter
Built on top Hadoop & Zookeeper:Mappers are used as nodes: N workers + 1 master
Master-worker coordination via Zookeeper
Natively reads and writes to HDFS
Natively reads and writes Writables
Can use counters, distributed cache, etc.
https://giraph.apache.org/
Apache Giraph
Pros:Integrates well with Hadoop
Has many examples included
Much better tool for processing graphs than raw MR
Cons:Documentation could be better
Still evolving: API changes in Giraph 1.1.0
Not as used as other Hadoop projects
Giraph Vertex API
public class MyVertex extends Vertex {
@Override public void compute(Iterable msgs) throws IOException { int superstep = getSuperstep(); // Current superstep setValue(val); // Modifies vertex value sendMessage(neighbor, value); // Sends message to a neighbor sendMessageToAllEdges(value); // Sends message to all neighbors }}
Vertex ID Type
Vertex Value Type
Edge Value Type
Message Value Type
Giraph Vertex API
Look at the shortest path source code:SimpleShortestPathsVertex.java (v1.0.0)
Giraph Input/Output
You can read vertex oriented (adjacency list) or edge oriented (pairs of vertices) files
Many formats already available:VertexInputFormat / VertexOutputFormat
HiveVertexInputFormat / HiveVertexOutputFormat
You can easily read any format extending VertexInputFormat / EdgeInputFormat
Thanks!