processing large graphs in hadoop

Processing LargeGraphs in Hadoop

Dani Sol

Index

The Problem

Google's Pregel

Example

Apache Giraph

The Problem

Processing graphs in MR is not practical:Most algorithms are iterative

Each iteration is mapped to a MR Job

Takes too long if many iterations are required

Writing MR for graph processing is not easy

Google's Pregel

Framework for iterative large graph processing

Inspired by Bulk Synchronous Parallel model

Computation is distributed among N+1 nodesN workers that do the actual work

1 master that synchronizes them

Takes a vertex-centric approachIs much easier to focus on the algorithm

http://kowshik.github.io/JPregel/pregel_paper.pdf

Pregel Main Concepts

Computations are a sequence of supersteps

Vertices are randomly distributed among nodes

Vertices have values and directed edges to other vertices

Vertices can send messages to other vertices

Messages sent at superstep S are received at superstep S + 1

Computation Life Cycle

Initially, all vertices are active

Inactive vertices activate again on receiving messages

In each superstep, active vertices:Receive messages from the previous superstep

Can change their value depending on their state

Can check the value of their neighbors

Can send messages to other vertices

Can vote to halt, becoming inactive

When all vertices are inactive, computation ends

Ex: Shortest Path AD

Single source shortest paths example

Want to find the shortest path from A to D

For simplicity, edges have value 1


A: 0 B: C: D: E: Superstep 0: All vertices active, A sends messages and halts

0+1

0+1

0+1


A: 0 B: 1C: 1D: E: 1Superstep 1: B, C, E get the messages and update their values

1+1

1+1


A: 0 B: 1C: 1D: 2E: 1Superstep 2: E gets mssge from B, but doesn't change its value


A: 0 B: 1C: 1D: 2E: 1Superstep 3: All vertices have halted and the computation ends

Apache Giraph

Open-source implementation of Pregel

Started by Yahoo, used by FB, LinkedIn, Twitter

Built on top Hadoop & Zookeeper:Mappers are used as nodes: N workers + 1 master

Master-worker coordination via Zookeeper

Natively reads and writes to HDFS

Natively reads and writes Writables

Can use counters, distributed cache, etc.

https://giraph.apache.org/

Apache Giraph

Pros:Integrates well with Hadoop

Has many examples included

Much better tool for processing graphs than raw MR

Cons:Documentation could be better

Still evolving: API changes in Giraph 1.1.0

Not as used as other Hadoop projects

Giraph Vertex API

public class MyVertex extends Vertex {

@Override public void compute(Iterable msgs) throws IOException { int superstep = getSuperstep(); // Current superstep setValue(val); // Modifies vertex value sendMessage(neighbor, value); // Sends message to a neighbor sendMessageToAllEdges(value); // Sends message to all neighbors }}

Vertex ID Type

Vertex Value Type

Edge Value Type

Message Value Type

Giraph Vertex API

Look at the shortest path source code:SimpleShortestPathsVertex.java (v1.0.0)

Giraph Input/Output

You can read vertex oriented (adjacency list) or edge oriented (pairs of vertices) files

Many formats already available:VertexInputFormat / VertexOutputFormat

HiveVertexInputFormat / HiveVertexOutputFormat

You can easily read any format extending VertexInputFormat / EdgeInputFormat

Thanks!

processing large graphs in hadoop

Software