[2d3]turbograph- ultrafast graph analystics engine for billion-scale graphs in a single machine

TurboGraph: A Fast Parallel Graph

Engine Handling Billion-scale

Graphs in a Single PC

1

Wook-Shin Han

Pohang University of Science and Technology (POSTECH)

Brief Introduction

• 20 years of experiences in database engines – Object-relational DBMS supporting multiple

bindings

– Tight integration of DBMS with IR

– Progressive optimization in Parallel DB2

– Parallelizing optimizer in PostgreSQL

– iGraph

– TurboISO

– TurboGraph

1

Welcome to Graph World

2

Friendship network

Protein interactions

Internet map

Semantic web

Call Graph

Big Graphs in Real World

3

Successful commercialization cases

• (CMU startup)

• $6.75M in funding from Madrona Venture Group and

NEA (5/14/2013)

•

• $2.55M in seed funding (11/14/2013)

• Funded by Farzad Nazem (founding CTO of Yahoo)

and Shelley Zhuang (Founder of Eleven Two Captial)

4

Vertex-Centric Programming in

Pregel

PageRankVertex::compute(messages)

{

for each msg in messages:

sum = sum + msg->Value()

SetValue(0.15/NumVertices() + 0.85*sum);

…

SendMessageToAllNeighbors(GetValue()/n);

…

} 5

Pregel-like systems

6

1

3

2

6

4

5

machine 1 machine 2 Original graph

Pregel-like systems

7

1

3

2

6

4

5

1

3

2

6

4

5


Pregel-like systems

8

1

3

2

6

4

5

1

3

2

6

4

5


Pregel-like systems

9

1

3

2

6

4

5

1

3

2

6

4

5


Pregel-like systems

10

1

3

2

6

4

5

1

3

2

6

4

5


Motivation

11

Gbase [KDD’11,VLDBJ’12]

Pregel [SIGMOD’10]

[VLDB’12]

GraphChi [OSDI’12]

Distributed

System

approach

Single machine

approach

DBMS approach VERY SLOW for mining???

Can we exploit nice concepts in DBMSs without losing performance?

Comparison with other engines

12

[1] I. Stanton and G. Kliot, "Streaming Graph Partitioning for Large Distributed Graphs," KDD 2012.

[2] U. Kang, H. Tong, J. Sun, C. Lin, and C. Faloutsos, "GBASE: An Efficient Analysis Platform for Large Graphs," VLDB Journal, 2012.

[3] A. Kyrola, G. Blelloch, C. Guestrin, "GraphChi: Large-Scale Graph Computation on Just a PC," OSDI 2012.

Why is TurboGraph Ultra-fast?

• Full parallelism

• Multi-core parallelism

• Flash SSD IO parallelism

• Reading 400~500 Mbytes/sec from commodity SSDs

• 97K IOPS (High-performance Random Read)

• Full overlap

• CPU processing and I/O processing

• I/O latency can be hidden!

13

Three things to remember

• Efficient disk/in-memory graph storage

• Pin-and-slide model

• Handling general vectors (see the

paper)

14

Challenges for Graph Storage

• Adjacency list vs. adjacency matrix

• Two types of graph operations in disk-based graphs

• Graph traversal (unique in graphs)

• Bitmap operations during computation

15

Disk-based representation in

TurboGraph

• Slotted page of 1 Mbyte size

• Page contains records corresponding to adjacency lists

• RID consists of a page ID and a slot number

• Vertex IDs or RIDs in adjacency list

• Vertex ID approach

• Good for bitmap operation

• Bad for graph traversal

– requires a potentially LARGE mapping table!

• RID approach

• Good for graph traversal

• Seems to be bad for bitmap operation??

16

RID (mapping) table

• Each entry corresponds to a page (not a single RID)

• Size is very small

• Each entry stores the starting vertex ID in the page

• Translation of RID (pageID, slotNo) to vertex ID

• RIDTable[pageID].startVertex + slotNo

• Can be done in O(1)

17

In-memory Data structures

• Buffer pool

• Mapping table from page ID to frame ID

• Hash-table based mapping incurs significant performance

overhead for graph traversal!

• TurboGraph uses a page table approach!

• A data structure for handling large adjacency list

(see paper)

18

Example

19

Core operations in buffer pool

• PINPAGE(pid)/UNPINPAGE(pid)

• support large adjacency lists

• PINCOMPUTEUNPIN(pid, RIDList, uo)

• Prepins an available frame

• Issues an asynchronous I/O request to the FlashSSD

• On completion of the I/O, a callback thread processes

the vertices in the RIDList by invoking the user-defined

function uo.Compute

• After processing all vertices in RIDList, unpin the page

20

Supported Query Power:

Matrix-vector multiplication

• G = (V,E), X (column vector)

• M(G)i: i-th column vector of G

• Column view:

• Applications can define their own multiplication and

summation semantics (the user-defined function

Compute can generalize both)

• M(G)i is represented as the adjacency list of vi

• We can restrict the computation to just a subset

of vertices

21

Column-view of matrix-vector

multiplication in TurboGraph

22

Pin-and-Slide Model

• New computing model for efficiently processing the

generalized matrix-vector multiplication in the

column view

• Utilizing execution thread pool and callback

thread pool

23

Pin-and-Slide Model (cont’d)

• Given a set V of vertices of interest,

• Identify the corresponding pages for V

• Pin the pages in the buffer pool

• Issue parallel asynchronous I/O requests for pages

which are not in the buffer

• Without waiting for the I/O completion, execution

threads concurrently process vertices in V that are in

the pages pinned

• Slide the processing window one page at a time as

soon as either an execution thread or a callback thread

finishes the processing of a page

24

Example

25

I = (0,1,1,1,0,1,1) v1 v5

1. identify pages (p0, p1, p2, p3, p4) for I

2. Pin p1 and p2

3. Issue asynchronous I/O request for p0

4. Execution threads process v2, v3, and v5 concurrently

5. On completion of I/O request for p0, callback threads process v1

6. After processing any page, unpin the page and slide execution window

i.e., process p3 and finally process p4

Handling general vectors

• Indicator vector can be implemented as a bitmap

• However, what if we want to use general vectors

instead?

• Consider PageRank where we need random accesses

to pagerank values and out-degrees in two general

vectors

26

Main idea of handling general

vectors

• Adopt the concept of block-based nested loop join

• a general vector is partitioned into multiple

chunks such that each chunk fits in memory

• Regard the pages pinned in the current buffer as a

block

• Join a block with a chunk of each random vector

in-memory until we consume all chunks

• Hide this mechanism as much as possible from

users! (see paper)

27

Example

28

Processing Graph Queries

• We support graph queries based on matrix-

vector multiplication

• Targeted queries processing only part of a graph

• Global queries processing the whole graph

• Targeted queries

• BFS, K-step neighbors, Induced subgraph, K-step

egonet, K-core, cross-edges etc.

• Global queries

• PageRank, connected component

29

Experimental setup

• Datasets

• LiveJournal (4.8M vertices), Twitter (42M vertices),

YahooWeb (1.4B vertices)

• Intel i7 6-core PC with 12 GB RAM

• 512GB SSD (Samsung 840 series)

• Bypass OS cache to guarantee real I/Os

• Main competitors

• GraphChi

• GreenMarl [ASPLOS’12] (in-memory graph engine)

30

Breadth-First Search (cold-run)

• Varying the buffer size

31

• GraphChi does not show performance improvement for larger buffer size

• Pin-and-Slide of TurboGraph utilizes the buffer pool in a smart way!

• TurboGraph outperforms GreenMarl by a small margin which first loads

the whole graph in memory and executes BFS

Varying # of execution threads

(hot run)

32

• TurboGraph achieves better speedups than GreenMarl, although

GreenMarl slightly outperforms TurboGraph

• GrapchChi shows poor performance as we increase the number of

execution threads

Targeted Queries

33

TurboGraph outperforms GraphChi by up to four orders of magnitude.

Global Queries

34

TurboGraph outperforms GraphChi by up to

27.69 times for PageRank.

144.11 times for Connected Component+.

+upcoming paper for details and much faster performance

Triangles in Graph Analysis and Mining

1

Clustering Coefficient

Transitivity

Triangonal connectivity

Community Detection

Spam Detection

SIGMOD 2014

Goal

• We propose an Overlapped, Parallel, disk-based Triangulation framework for a single machine

2

SIGMOD 2014

HOW

• We propose an overlapped, parallel, disk-based triangulation framework in a single machine.

– By recognizing common components in disk-based triangulation

4

SIGMOD 2014

HOW

• We propose an overlapped, parallel, disk-based triangulation framework in a single machine.

– By using a two-level overlapping strategy

• Micro level: asynchronous, parallel I/O using FlashSSD

• Macro level: multi-core parallelism

5

SIGMOD 2014

Graph Representation

• A graph is represented in the adjacency list form. • Adjacency lists are stored in slotted pages. • I/O is executed in page-level.

6

a

b

d

c

e

f

g

h

n(a) = {b,c}

n(b) = {a,c}

n(c)

n(d)

n(e)

n(f)

n(g)

n(h)

p1 p2 p3 p4

SIGMOD 2014

When main memory is not enough …

7

a

b

d

c

e

f

g

h

p1 p2 p3 p4

n(a)

n(b)

n(c)

n(d) No space to load!! Main Memory

Disk

We want to identify triangles in which a, b, c, and d participate.

Observation

• △abc and △cdf are identified using loaded adjacency lists in main memory Internal triangles

8 8

p1 p2

a

b

d

c

e

f

g

h

SIGMOD 2014

Observation

• △cfg, △cgh, and △def cannot be identified in main memory. External triangles

• To identify them, adjacency lists of e, f, g, and h should be loaded in main memory. External candidate vertices

9 9

p3 p4

a

b

d

c

e

f

g

h

SIGMOD 2014

Related Work

• …

• MGT [SIGMOD2013]

– The first work to use block-based nested loop

– Shows nontrivial bounds for triangulation

10

SIGMOD 2014

Overall Procedure

1. Split main memory into two parts – internal area and external area

2. Load a part of adjacency lists in internal area 3. Identify external candidate vertices 4. Identify internal triangles 5. Load adjacency lists of external candidate vertices in

external area 6. Find external triangles 7. Go to step 5 until all adjacency lists of external candidate

vertices are loaded 8. Go to step 2 until all adjacency lists are loaded in internal

area

11

SIGMOD 2014

Generic Framework

• Any vertex/edge iterator triangulation method is applicable to OPT

• Defining

– Internal triangulation algorithm

– External candidate vertices identification algorithm

– External triangulation algorithm

are enough.

12

Macro-Level Overlapping

• Using multi-core parallelism,

– OPT overlaps internal triangulation and external triangulation.

13

Macro-Level Overlapping

• After applying both-level of overlap, OPT

1. Split main memory into two parts – internal area and external area

2. Load a part of adjacency lists in internal area 3. Identify external candidate vertices 4. Identify internal triangles 5. Load adjacency lists of external candidate vertices in external area 6. Find external triangles 7. Go to step 5 until all adjacency lists of external candidate vertices

are loaded 8. Go to step 2 until all adjacency lists are loaded in internal area

14

Macro-level overlapping

Micro-Level Overlapping

• In OPT, all reads are requested by AsyncRead. • After applying micro-level overlapping, OPT 1. Split main memory into two parts – internal area and external

area 2. Load a part of adjacency lists in internal area 3. Identify external candidate vertices 4. Identify internal triangles 5. Load adjacency lists of external candidate vertices in external area 6. Find external triangles 7. Go to step 5 until all adjacency lists of external candidate vertices

are loaded 8. Go to step 2 until all adjacency lists are loaded in internal area

15

Micro-level overlapping

Experiment Setup

• Datasets

• Intel i7 6-core PC with 16GB

• 512GB FlashSSD (Samsung 830)

• OPT is implemented using TurboGraph [KDD’13].

• Competitors – GraphChi [OSDI’12]

– CC-SEQ/CC-DS [KDD’11]

– MGT [SIGMOD’13]

16

Dataset |V| |E|

LJ 4.8M 69.0M

ORKUT 3.1M 223.5M

TWITTER 41.7M 1.47B

UK 105.9M 3.74B

YAHOO 1.41B 6.64B

SIGMOD 2014

Effect of Micro-Level Overlapping

17

Less than 7% overhead

(x axis: memory size (% of graph size), y axis: relative elapsed time to in-memory method)

SIGMOD 2014

Effect of Number of CPU Cores

18

SIGMOD 2014

Result on YAHOO Dataset

• Single core – OPT showed 2.04/10.72 times shorter elapsed

time than MGT/GraphChi.

• Six cores – OPT showed 31.36 times shorter elapsed time

than GraphChi.

19

SIGMOD 2014

Conclusions

• A ultra-fast, parallel graph engine called

TurboGraph for efficiently handling billion-scale

graphs in a single PC

• Efficient Graph Storage

• Pin-and-slide model which implements the column

view of the matrix-vector multiplication

• Extensive experiments shows the outstanding

performance of TurboGraph for real, billion-scale

graphs

35