evaluating use of data flow systems for large graph analysis
TRANSCRIPT
Lawrence Livermore National Laboratory
Evaluating Use of Data Flow Systems
for Large Graph Analysis
Andy Yoo and Ian Kaplan
Lawrence Livermore National Laboratory, P. O. Box 8 08, Livermore, CA 94551This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contra ct DE-AC52-07NA27344
Graph mining techniques have been widely-used in
many important applications in recent years
� Graph mining extracts information by analyzing relationsand structures in graphs (such as ER graphs)
So-called “scale-free” graphs can carry rich information
2
Lawrence Livermore National Laboratory
Graph Mining Applications: Web Search
� Google’s PageRank uses a web graph to rank web pages for given queries
� Related applications– Personalized web
search
3
Lawrence Livermore National Laboratory
search– People search – Eigenvalue/eigenvector– Random walk with
restart
Graph Mining Applications: Social Network Analysis
Community detection algorithms can identify the two communities
(e.g., Girvan and Newman, 2002)
4
Lawrence Livermore National Laboratory
Zachary’s Karate Club, 1977
Divided into two groups centered around two individuals, 1 and 34
Further analysis reveals detailed community structures in the graph
(e.g., van Dongen, 2000 and Palla, 2005)
Graph Mining Applications: Protein Clustering
Can discover proteins with similar functions by clustering protein modules in the protein-protein interaction graphs.
5
Lawrence Livermore National Laboratory
Protein-protein interaction network of yeast
Adamcsek et. al., Bioinformatica, 1021, 2006
Graph Mining Applications: National Security
� Apply subgraph pattern matching algorithms to intelligence analysis (e.g., J. Ullman, 1976)
� Other related applications– Exact and inexact
6
Lawrence Livermore National Laboratory
– Exact and inexact pattern discovery
– Fraud detection– Cyber security– Behavioral prediction T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies
for intelligence analysis, ACM, 2004
Challenges
� High complexity of graph mining algorithms– Common graph mining algorithms have high-order
computational complexity• High-order algorithms (O(N2+))
− Page rank, community finding, path traversal
• NP-Complete algorithms
7
Lawrence Livermore National Laboratory
• NP-Complete algorithms− Maximal cliques, subgraph pattern matching
� Large data size requires out-of-core approaches– Graphs with 109+ nodes and edges are increasingly
common– Intermediate result increases exponentially in many
cases
Traditional relational databases have been used in
large graph analysis
� Due to prevalence and ease of use conventional database systems have been used in graph analysis
� Designed for transaction processing
26%
37%
30%
2%
5%
0% 5% 10% 15% 20% 25% 30% 35% 40%
< 1 minute
1 - 2 minutes
2 - 5 minutes
5 - 10 minutes
10+ minutes
Distribution of Response Time for 100 Bi-directiona l searches
300B node graph search on Netezza on 700-node NPS (SC’06)
8
Lawrence Livermore National Laboratory
� Poor performance and scalability
300B node graph search on Netezza on 700-node NPS (SC’06)
120B node graph search on 60-node MSSG (Cluster ’06)
Many-tasks paradigm is currently used for analyzing
large data sets: Map/Reduce
� Map/Reduce is a popular many-tasks model being used for a wide range of applications
� Map/Reduce model– A M/R program consists of many
map and reduce tasks
9
Lawrence Livermore National Laboratory
map and reduce tasks – Each task works independently– Data between mappers and
reducers via intermediate files – Processes list of (key, value)
pairs� Is Map/Reduce for everything?
Map/Reduce model
Map/Reduce model is too limited for large complex
graph analysis
� Map/Reduce successfully used for some applications, but– Inverted index construction – Distributed sort– Term-vector calculation– Page Rank
� Drawbacks
10
Lawrence Livermore National Laboratory
� Drawbacks – Model limited to embarrassingly parallel applications– Poor performance and scalability (due to poor handling of
intermediate results)
System Platform Time (Sec)
Map/Reduce 20-node Fenix Cluster 1068
SGRACE 64-node Tuson Cluster 221
BFS Search Results
Full PubMed graph with 30 million vertices and 500 million edges were used, except SGRACE for which a synthetic graph with 25 million vertices and 125 million edges is used
333.75 Sec/64 Nodes
Dataflow model is a promising alternative to address
these issues
� Many independent tasks accessing external data in parallel, realizing data parallelism– Tasks triggered by the
� More flexible and complex than Map/Reduce (Map/Reduce on steroids!!)
11
Lawrence Livermore National Laboratory
– Tasks triggered by the availability of data
– No flow of control– Data parallel and independent
� We evaluated the use of dataflow model for large graph analysis in this work
Dryad dataflow diagram
We measured the performance of graph algorithms on an
actual dataflow machine: Data Analytic Supercomputer
� Parallel dataflow engine on commodity clusters
� Specialized high-performance library
DAS RDBMSVS.
� Sequential or parallel relational database systems on commodity HW
12
Lawrence Livermore National Laboratory
library• Streaming data pipelined for
maximum in-memory processing
• Sequentialized disk accesses• Optimized for SORT and JOIN
operations� Offers great flexibility for
optimization
commodity HW� Optimized for transaction
processing� Ubiquitous � Relatively easy to use� Relies on SQL compiler for
optimization
DAS programming and execution environment
� Uses ECL, a proprietary dataflow language
� Built-in ECL data manipulation constructs are implemented in a highly optimized library
ECL Code
ECL Compiler ECL Library
C++ Code
13
Lawrence Livermore National Laboratory
optimized library– JOIN, SORT, MERGE,
etc.� Unlike SQL, these low-level
constructs are suitable for complex graph operations
CE CE CE
Executable
…
An example ECL code
vertex_rec := RECORD
INTEGER8 gid;
END;
adjacent_raw :=
14
Lawrence Livermore National Laboratory
DATASET('pubmed::datasets::full::bi_links_split_ds', PubMed_Definitions_Full.Links_Bidirectional, THOR);
adjacent_distr := DISTRIBUTE(adjacent_raw,HASH32(src_gid));
adjacent_sort := SORT(adjacent_distr, src_gid, LOCAL);
adjacent := DEDUP(adjacent_sort, src_gid, LOCAL);
OUTPUT(adjacent);
We evaluated some of the most commonly used
applications in our experiments
Path Traversal Uni- and Bi-directional BFS
Pattern Matching Find subgraphs that matches given template
Applications evaluated on DAS System
15
Lawrence Livermore National Laboratory
template
TeraByte (TB) Sort Jim Gray’s SORT Benchmark
Page Rank Eigenvector using power method
Disambiguation Binning-based coreference resolution
Real-world graphs are used in our performance
experiments
Grant
FundedByGrant
Author
IsAuthorOf
Journal
IsIssueOf
Grant
Agency
IssuedGrant
PubMed Sm PubMed Lg
|V| 1M 29M
16
Lawrence Livermore National Laboratory
Article PublishedIn
Chemical
HasChemical
Keyword
HasKeyword
ContactInfo
MeshHeading
HasContactInfo
HasMeshHeading
Journal
Issue
Grant ueOf|V| 1M 29M
|E| 2M 270M
Raw data size
400 MB 127 GB
Path Traversal: Breadth-first search (BFS) on DAS
Source
Destination
Improved performance by constructing adjacent list via denormalization, which reduces the number of rows to join
(Seconds)
17
Lawrence Livermore National Laboratory
Edge List Adjacency List (Denormalized)
Unidirectional 287.926 120.359
Bidirectional 204.90 56.431Used large PubMed data
(Seconds)
DAS system is ideal for handling complex subgraph
pattern queries on large data sets
Find authors who published four articles in the
18
Lawrence Livermore National Laboratory
Find authors who published four articles in specific dates (Query 1)
Find authors who published two articles in the same journal (Query 2)
Find authors who published four articles in the journal Physical Review Letters (Query 3)
DAS system is ideal for handling complex subgraph
pattern queries on large data sets (Cont’d)
19
Lawrence Livermore National Laboratory
Find two authors who have co-authored two papers (Query 4)
Find an article that has an associated grant and an article that does not have an associated grant and their corresponding authors (Query 5)
Query Performance for Large PubMed (30M
nodes)
DAS(20 nodes)
Netezza (54 nodes)
YADM(4 nodes)
Query 1 9.422 27.3 120.00
Query 2 142.099 834.47 930.00
20
Lawrence Livermore National Laboratory
Query 3 469.511 15392.96 10188.00
Query 4 37.803 741.42 667.00
Query 5 44.600 496.48 N/A
(Seconds)~250 - ~300X Speedup
DAS system still outperforms other SQL machines in
price/performance
DAS Netezza YADM
Query 1 2.51253E-05 6.66144E-05 0.0004
Query 2 0.000378931 0.00203618 0.0031
21
Lawrence Livermore National Laboratory
Query 3 0.001252029 0.037560164 0.03396
Query 4 0.000100808 0.001809129 0.002223333
Query 5 0.000118933 0.001211454 N/A
Metric = Time/(#Spindles * Cost)
LNSSI/LLNL measured Terabyte Sort (TB Sort)
performance
� One of sort benchmarks that measures the elapsed time to sort 1012 bytes of data
� Yahoo holds current record (as of March 2009)• 3.48 minutes on 910 nodes (4 dual-core processors, 4 disks, 8
GB memory) Hadoop Map/Reduce� Performed TB sort on 20-node DAS system
22
Lawrence Livermore National Laboratory
� Achieved 2X speedup by• Radix-based distribution and local sort
• Makes tasks to be independent• Optimized SORT operation
Apache Hadoop 03:20:44
DAS 01:39:26
Found some key people from large Enron email graph
by running Page Rank algorithm
� Data has 4022 Enron employees and 51078 emails
� Top 30 high scorers found with some notable names – Jeff Dasovich
23
Lawrence Livermore National Laboratory
– Jeff Dasovich– Louise Kitchen– Tana Jones– John Lavorato
� Took 7 seconds to run on DAS
Developed scalable algorithm for author
disambiguation in many-tasks paradigm
� LLNL has develop a entity resolution algorithm based on binning (or blocking) algorithm
� Original algorithm not complete for full PubMed data set – Only 67% completed (in 2 months)– Could not resolve bins > 300+ names
24
Lawrence Livermore National Laboratory
– Could not resolve bins > 300+ names� Uses DAS as an active-disk system
– “Bring computation to where data is, instead of moving data from data store”
� Achieved orders of magnitude performance improvement
Distributed disambiguation algorithm: DAS as an
Active Disk
Binning Algorithm (BA)
Bin
Coauthors
Keywords
Original (Sequential) Algorithm Many-tasks Disambiguation Algorithm
JDBC
Author Info
…Reader
25
Lawrence Livermore National Laboratory
Abstracts
Titles
MySQL RDBMS
performance bottleneck
Binning algorithm works only on local data by many-tasks in parallel.
Able to process 45529883 bins in 20 hours!
…
…
Binner
Resolver
Disambiguated Authors
Many-tasks model has enabled efficient large graph
analysis
� High performance and scalability feasible by many-tasks approach
� Benefits– Enables data parallelism on large
scale data
26
Lawrence Livermore National Laboratory
– Reduces communication via independent localized tasks
– Enables optimization of tasks for built-in constructs
– Combines complexity and flexibility
Conclusions
� Studied the use of many-tasks model for large complex graph analysis
� Evaluated the performance of a comprehensive set of graph applications, including subgraph pattern queries, on an actual dataflow system
� Many-tasks paradigm is very promising approach for
27
Lawrence Livermore National Laboratory
� Many-tasks paradigm is very promising approach for graph mining applications and offers many advantages over contemporary methods like RDBMS and Map/Reduce