evaluating use of data flow systems for large graph analysis

Lawrence Livermore National Laboratory

Evaluating Use of Data Flow Systems

for Large Graph Analysis

Andy Yoo and Ian Kaplan

Lawrence Livermore National Laboratory, P. O. Box 8 08, Livermore, CA 94551This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contra ct DE-AC52-07NA27344

Graph mining techniques have been widely-used in

many important applications in recent years

� Graph mining extracts information by analyzing relationsand structures in graphs (such as ER graphs)

So-called “scale-free” graphs can carry rich information

2


Graph Mining Applications: Web Search

� Google’s PageRank uses a web graph to rank web pages for given queries

� Related applications– Personalized web

search

3


search– People search – Eigenvalue/eigenvector– Random walk with

restart

Graph Mining Applications: Social Network Analysis

Community detection algorithms can identify the two communities

(e.g., Girvan and Newman, 2002)

4


Zachary’s Karate Club, 1977

Divided into two groups centered around two individuals, 1 and 34

Further analysis reveals detailed community structures in the graph

(e.g., van Dongen, 2000 and Palla, 2005)

Graph Mining Applications: Protein Clustering

Can discover proteins with similar functions by clustering protein modules in the protein-protein interaction graphs.

5


Protein-protein interaction network of yeast

Adamcsek et. al., Bioinformatica, 1021, 2006

Graph Mining Applications: National Security

� Apply subgraph pattern matching algorithms to intelligence analysis (e.g., J. Ullman, 1976)

� Other related applications– Exact and inexact

6


– Exact and inexact pattern discovery

– Fraud detection– Cyber security– Behavioral prediction T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies

for intelligence analysis, ACM, 2004

Challenges

� High complexity of graph mining algorithms– Common graph mining algorithms have high-order

computational complexity• High-order algorithms (O(N2+))

− Page rank, community finding, path traversal

• NP-Complete algorithms

7


• NP-Complete algorithms− Maximal cliques, subgraph pattern matching

� Large data size requires out-of-core approaches– Graphs with 109+ nodes and edges are increasingly

common– Intermediate result increases exponentially in many

cases

Traditional relational databases have been used in

large graph analysis

� Due to prevalence and ease of use conventional database systems have been used in graph analysis

� Designed for transaction processing

26%

37%

30%

2%

5%

0% 5% 10% 15% 20% 25% 30% 35% 40%

< 1 minute

1 - 2 minutes

2 - 5 minutes

5 - 10 minutes

10+ minutes

Distribution of Response Time for 100 Bi-directiona l searches

300B node graph search on Netezza on 700-node NPS (SC’06)

8


� Poor performance and scalability

300B node graph search on Netezza on 700-node NPS (SC’06)

120B node graph search on 60-node MSSG (Cluster ’06)

Many-tasks paradigm is currently used for analyzing

large data sets: Map/Reduce

� Map/Reduce is a popular many-tasks model being used for a wide range of applications

� Map/Reduce model– A M/R program consists of many

map and reduce tasks

9


map and reduce tasks – Each task works independently– Data between mappers and

reducers via intermediate files – Processes list of (key, value)

pairs� Is Map/Reduce for everything?

Map/Reduce model

Map/Reduce model is too limited for large complex

graph analysis

� Map/Reduce successfully used for some applications, but– Inverted index construction – Distributed sort– Term-vector calculation– Page Rank

� Drawbacks

10


� Drawbacks – Model limited to embarrassingly parallel applications– Poor performance and scalability (due to poor handling of

intermediate results)

System Platform Time (Sec)

Map/Reduce 20-node Fenix Cluster 1068

SGRACE 64-node Tuson Cluster 221

BFS Search Results

Full PubMed graph with 30 million vertices and 500 million edges were used, except SGRACE for which a synthetic graph with 25 million vertices and 125 million edges is used

333.75 Sec/64 Nodes

Dataflow model is a promising alternative to address

these issues

� Many independent tasks accessing external data in parallel, realizing data parallelism– Tasks triggered by the

� More flexible and complex than Map/Reduce (Map/Reduce on steroids!!)

11


– Tasks triggered by the availability of data

– No flow of control– Data parallel and independent

� We evaluated the use of dataflow model for large graph analysis in this work

Dryad dataflow diagram

We measured the performance of graph algorithms on an

actual dataflow machine: Data Analytic Supercomputer

� Parallel dataflow engine on commodity clusters

� Specialized high-performance library

DAS RDBMSVS.

� Sequential or parallel relational database systems on commodity HW

12


library• Streaming data pipelined for

maximum in-memory processing

• Sequentialized disk accesses• Optimized for SORT and JOIN

operations� Offers great flexibility for

optimization

commodity HW� Optimized for transaction

processing� Ubiquitous � Relatively easy to use� Relies on SQL compiler for

optimization

DAS programming and execution environment

� Uses ECL, a proprietary dataflow language

� Built-in ECL data manipulation constructs are implemented in a highly optimized library

ECL Code

ECL Compiler ECL Library

C++ Code

13


optimized library– JOIN, SORT, MERGE,

etc.� Unlike SQL, these low-level

constructs are suitable for complex graph operations

CE CE CE

Executable

…

An example ECL code

vertex_rec := RECORD

INTEGER8 gid;

END;

adjacent_raw :=

14


DATASET('pubmed::datasets::full::bi_links_split_ds', PubMed_Definitions_Full.Links_Bidirectional, THOR);

adjacent_distr := DISTRIBUTE(adjacent_raw,HASH32(src_gid));

adjacent_sort := SORT(adjacent_distr, src_gid, LOCAL);

adjacent := DEDUP(adjacent_sort, src_gid, LOCAL);

OUTPUT(adjacent);

We evaluated some of the most commonly used

applications in our experiments

Path Traversal Uni- and Bi-directional BFS

Pattern Matching Find subgraphs that matches given template

Applications evaluated on DAS System

15


template

TeraByte (TB) Sort Jim Gray’s SORT Benchmark

Page Rank Eigenvector using power method

Disambiguation Binning-based coreference resolution

Real-world graphs are used in our performance

experiments

Grant

FundedByGrant

Author

IsAuthorOf

Journal

IsIssueOf

Grant

Agency

IssuedGrant

PubMed Sm PubMed Lg

|V| 1M 29M

16


Article PublishedIn

Chemical

HasChemical

Keyword

HasKeyword

ContactInfo

MeshHeading

HasContactInfo

HasMeshHeading

Journal

Issue

Grant ueOf|V| 1M 29M

|E| 2M 270M

Raw data size

400 MB 127 GB

Path Traversal: Breadth-first search (BFS) on DAS

Source

Destination

Improved performance by constructing adjacent list via denormalization, which reduces the number of rows to join

(Seconds)

17


Edge List Adjacency List (Denormalized)

Unidirectional 287.926 120.359

Bidirectional 204.90 56.431Used large PubMed data

(Seconds)

DAS system is ideal for handling complex subgraph

pattern queries on large data sets

Find authors who published four articles in the

18


Find authors who published four articles in specific dates (Query 1)

Find authors who published two articles in the same journal (Query 2)

Find authors who published four articles in the journal Physical Review Letters (Query 3)

DAS system is ideal for handling complex subgraph

pattern queries on large data sets (Cont’d)

19


Find two authors who have co-authored two papers (Query 4)

Find an article that has an associated grant and an article that does not have an associated grant and their corresponding authors (Query 5)

Query Performance for Large PubMed (30M

nodes)

DAS(20 nodes)

Netezza (54 nodes)

YADM(4 nodes)

Query 1 9.422 27.3 120.00

Query 2 142.099 834.47 930.00

20


Query 3 469.511 15392.96 10188.00

Query 4 37.803 741.42 667.00

Query 5 44.600 496.48 N/A

(Seconds)~250 - ~300X Speedup

DAS system still outperforms other SQL machines in

price/performance

DAS Netezza YADM

Query 1 2.51253E-05 6.66144E-05 0.0004

Query 2 0.000378931 0.00203618 0.0031

21


Query 3 0.001252029 0.037560164 0.03396

Query 4 0.000100808 0.001809129 0.002223333

Query 5 0.000118933 0.001211454 N/A

Metric = Time/(#Spindles * Cost)

LNSSI/LLNL measured Terabyte Sort (TB Sort)

performance

� One of sort benchmarks that measures the elapsed time to sort 1012 bytes of data

� Yahoo holds current record (as of March 2009)• 3.48 minutes on 910 nodes (4 dual-core processors, 4 disks, 8

GB memory) Hadoop Map/Reduce� Performed TB sort on 20-node DAS system

22


� Achieved 2X speedup by• Radix-based distribution and local sort

• Makes tasks to be independent• Optimized SORT operation

Apache Hadoop 03:20:44

DAS 01:39:26

Found some key people from large Enron email graph

by running Page Rank algorithm

� Data has 4022 Enron employees and 51078 emails

� Top 30 high scorers found with some notable names – Jeff Dasovich

23


– Jeff Dasovich– Louise Kitchen– Tana Jones– John Lavorato

� Took 7 seconds to run on DAS

Developed scalable algorithm for author

disambiguation in many-tasks paradigm

� LLNL has develop a entity resolution algorithm based on binning (or blocking) algorithm

� Original algorithm not complete for full PubMed data set – Only 67% completed (in 2 months)– Could not resolve bins > 300+ names

24


– Could not resolve bins > 300+ names� Uses DAS as an active-disk system

– “Bring computation to where data is, instead of moving data from data store”

� Achieved orders of magnitude performance improvement

Distributed disambiguation algorithm: DAS as an

Active Disk

Binning Algorithm (BA)

Bin

Coauthors

Keywords

Original (Sequential) Algorithm Many-tasks Disambiguation Algorithm

JDBC

Author Info

…Reader

25


Abstracts

Titles

MySQL RDBMS

performance bottleneck

Binning algorithm works only on local data by many-tasks in parallel.

Able to process 45529883 bins in 20 hours!

…

…

Binner

Resolver

Disambiguated Authors

Many-tasks model has enabled efficient large graph

analysis

� High performance and scalability feasible by many-tasks approach

� Benefits– Enables data parallelism on large

scale data

26


– Reduces communication via independent localized tasks

– Enables optimization of tasks for built-in constructs

– Combines complexity and flexibility

Conclusions

� Studied the use of many-tasks model for large complex graph analysis

� Evaluated the performance of a comprehensive set of graph applications, including subgraph pattern queries, on an actual dataflow system

� Many-tasks paradigm is very promising approach for

27


� Many-tasks paradigm is very promising approach for graph mining applications and offers many advantages over contemporary methods like RDBMS and Map/Reduce

Thank you

28


evaluating use of data flow systems for large graph analysis

Documents