big, fast graph analysis and data management for hadoop...nosql/ hdfs clientiniatepgxasa yarntask...
TRANSCRIPT
Big, Fast Graph Analysis and Data Management for Hadoop
Jay Banerjee, PhD, Senior Director, Oracle Spa@al & Graph Hassan Chafi, PhD, Senior Research Manager, Oracle Labs Zhe Wu, PhD, Architect, Oracle Spa@al and Graph
October 2, 2014
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direc@on. It is intended for informa@on purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or func@onality, and should not be relied upon in making purchasing decisions. The development, release, and @ming of any features or func@onality described for Oracle’s products remains at the sole discre@on of Oracle.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Introduc)on to Graphs
Data Management for Property Graphs in Hadoop
PGX – Parallel In-‐Memory Graph Analy)cs
1
2
3
Current Direc)on for Graph Research
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Introduc)on to Graphs
Data Management for Property Graphs in Hadoop
PGX – Parallel In-‐Memory Graph Analy)cs
1
2
3
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Introduc@on to Graphs
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
This en@re session covers the current research and development of a property graph prototype
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Rela@onal Tables vs Rela@onship Graphs
anI)neraryReserved
aPassenger
aFlightSegment
aBookingAgent anAirport
aFlightSchedule
aPayment
aReferenceDate
aFlightCost
A Graph Representa)on of the Airline Reserva)on Data Model
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
When to Choose Graph Based Data Modeling • Flexible schema
– No schema or incomplete schema, may evolve over @me – Types of rela@onships may be heterogeneous
• Heavily rela@onship centric – linked data model
• Analy@cs on the en@re data set or a subset is important – Find influencers in a social network – Find clusters of similar individuals
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Major Types of Graphs in Oracle Spa@al and Graph
• Network Data Model graph – Mainly used for physical, spa@al networks
– Water and gas pipelines, electricity networks, road networks
– Telcos, u@li@es, transporta@on
• RDF Seman@c graph – Enterprise class RDF Graph Database for Linked data; Knowledge management, Intelligence applica@ons
– Scales to petabytes of triples – by exploi@ng Exadata, RAC, SQL*Loader , Parallelism, Label Security
– W3C standards: RDFS, OWL2 RL, OWL2 EL, SPARQL 1.1, RDB2RDF, RDFa, SKOS plus User-‐defined rules
– Seman@c reasoning, unifying metadata from many sources
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Another kind of Graph: Property Graph
• @nkerpop.com provides a set of commonly used APIs • Primarily used for traversing enterprise and social networks
– Simple no@on of nodes and edges with proper@es
• Property graph model differs from RDF graph model – No industry standards for data representa@on and query language – No no@on of unique URIs (Universal Resource Iden@fiers) – No rule-‐based seman@cs (OWL, user-‐defined rules) – Edges may have proper@es (RDF edge proper@es managed in named graphs)
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Property Graph Example
(from hjps://github.com/@nkerpop/gremlin/wiki/Defining-‐a-‐Property-‐Graph)
A property graph has these elements: • a set of ver@ces
each vertex has a unique iden@fier. each vertex has a set of outgoing edges. each vertex has a set of incoming edges. each vertex has a collec@on of proper@es defined by a map from key to value.
• a set of edges each edge has a unique iden@fier. each edge has an outgoing tail vertex. each edge has an incoming head vertex. each edge has a label that denotes the type of rela@onship between its two ver@ces. each edge has a collec@on of proper@es defined by a map from key to value.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Value of Property Graph Analy@cs • Finding paths between ver@ces • Iden@fying primary influencers in a people network
• Iden@fying clusters to customize campaigns
• Predic@ng trends, customer behavior
• Managing brand reputa@on
• Sen@ment analysis
• Discovering rela@onships based on pajern matching
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Prototype Reference Architecture for Property Graph on Hadoop
Scalable and Persistent Storage Management
Graph Data Access API
Graph Analy)cs Single-‐Node in-‐Memory
Parallel In-‐memory Analy@c Engine
REST/Web Service Blueprints & Lucene
Graph Model and format
RDF (RDF/XML, N-‐Triples, N-‐Quads, TriG,N3,JSON)
PG (GraphML, GML, Graph-‐SON,
Flat Files)
Python, Perl, PHP, Ruby, Javascript, …
Java APIs
Java APIs
Property Graph formats
GraphML, GML,
Graph-‐SON, Flat Files
Other NoSQL Apache HBase
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Introduc)on to Graphs
Data Management for Property Graphs in Hadoop
PGX – Parallel In-‐Memory Graph Analy)cs
1
2
3
Current Direc)on for Graph Research
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Research into Property Graph Data Management for Hadoop
Oracle Property Graph Prototype
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Property Graph Data Access Layer • Key ajributes of the data access layer
– Persistent stores for storing graph data
– Efficient Tinkerpop Blueprints API implementa@on
– Text search through Apache Lucene integra@on
– Easy-‐to-‐use Java APIs for • Fast, parallel bulk load (import) of property graph into the database • Fast, parallel read (export) of property graph from the database • Fast, parallel text indexing of property graph data (ver@ces and edges)
– REST Interface
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Oracle NoSQL Database – Key features used
• Tables, Parent/Child table • Secondary indices • Parallel scanner
• Apache HBase ⁻ Key features used • Tables • Parallel scans across regions • Column family
Persistent Stores Used in Data Access Layer
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Oracle NoSQL Database – Key features used
• Tables, Parent/Child table • Secondary indices • Parallel scanner
• Apache HBase ⁻ Key features used • Tables • Parallel scans across/within regions • Column family
Persistent Stores Used in the Data Access Layer
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Efficient Blueprints API Implementa@on • OraclePropertyGraph
– Leverage na#ve database features • Use column family/map to aggregate K/V pairs • Use parallel table scan, secondary indices • Use parallel region scan
– Maintain a small, in-‐memory ver@ces/edges cache – Batch up opera@ons performed against graph elements
Graph KeyIndexableGraph Transac@onalGraph IndexableGraph
OraclePropertyGraphBase
oracle.pg.hbase. OraclePropertyGraph
oracle.pg.nosql. OraclePropertyGraph
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Easy-‐to-‐Use Java APIs • OraclePropertyGraph
– Setup database connec@on informa@on • kconfig = new KVStoreConfig("kvstore", hhosts); • hconn = [email protected]@on(conf);
– Create OraclePropertyGraph instance • opg = OraclePropertyGraph.getInstance(…)
– Add/remove/edit graph data • opgdl = OraclePropertyGraphDataLoader.getInstance(); • opgdl.loadData(opg, vertexFile, edgeFile, 32 /* parallel degree */);
– Navigate • vertex.getEdges(direc@onOut, “follows”) vertex.outE.inV.name
– Using PGX analy@c func@ons • analyst = opg.getInMemAnalyst();
• triangles = analyst.countTriangles().get();
DB Connec@on info
Create OraclePropertyGraph
Add/Remove/Edit Graph
Navigate/Search
Perform Analy@cs
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Text Search through Apache Lucene
• Integrate with Apache Lucene • Support manual and auto indexing of Graph elements
• Manual index:
• oraclePropertyGraph.createIndex(“my_index", Vertex.class);
• indexVer@ces = oraclePropertyGraph.getIndex(“my_index” , Vertex.class);
• [email protected](“key”, “value”, myVertex);
• Auto Index • oraclePropertyGraph.createKeyIndex(“name”, Edge.class);
• oraclePropertyGraph.getEdges(“name”, “*hello*world”);
• Scale to billions of graph elements
Find ver@ces using
syntax like: "*oracle* or *graph*”
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
REST Interface • Provides an implementa@on/client agnos@c way – Access graph data – Manipulate graph data setenv pg_graph hjp://localhost:8182/graphs/oracle_user_graph curl … -‐X POST "${pg_graph}/ver@ces/101"
curl … -‐X POST "${pg_graph}/ver@ces/102“
curl … -‐X POST
"${pg_graph}/edges/200?_outV=101&
_label=like&_inV=102&weight=1.2“
curl … -‐i -‐X DELETE "${pg_graph}/edges/200"
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Property Graph Serializa@on Formats
• GML
• GraphML
• GraphSON
• Oracle-‐defined Property Graph Flat Files • Vertex file • Edge file • Allow mul@ple data types to be
associated with one key
• Support Date with Timezone and Serializable objects
Different data types associated with the same
“likes” rela@onship
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Example Usage Flow with Typical Java APIs (NoSQL Database) – Setup database connec@on informa@on
• hhosts = new String[1]; hhosts[0] = "127.0.0.1:5000"; • kconfig = new KVStoreConfig("kvstore", hhosts);
– Create OraclePropertyGraph instance • opg = OraclePropertyGraph.getInstance(…)
– Add/remove/edit graph data • opgdl = OraclePropertyGraphDataLoader.getInstance(); • opgdl.loadData(opg, vertexFile, edgeFile, 32 /* parallel degree */); • opg.addEdge(null, vertexA, vertexB, “friend”);
– Navigate • opg.getVer@ces(); • vertex.getEdges(direc@onOut, “follows”) vertex.outE.inV.name
– Using PGX analy@c func@ons • analyst = opg.getInMemAnalyst();
• triangles = analyst.countTriangles().get();
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Example Usage Flow with Typical Java APIs (on Apache HBase) – Setup database connec@on informa@on
• conf = [email protected](); • conf.set("hbase.zookeeper.quorum", “host1,host2,host3"); conf.set("hbase.zookeeper.property.clientPort","2181");
– Create OraclePropertyGraph instance • opg = OraclePropertyGraph.getInstance(…)
– Add/remove/edit graph data • opgdl = OraclePropertyGraphDataLoader.getInstance(); • opgdl.loadData(opg, vertexFile, edgeFile, 32 /* parallel degree */); • opg.addEdge(null, vertexA, vertexB, “friend”);
– Navigate • opg.getVer@ces(); • vertex.getEdges(direc@onOut, “follows”) vertex.outE.inV.name
– Using PGX analy@c func@ons • analyst = opg.getInMemAnalyst();
• triangles = analyst.countTriangles().get();
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Performance Preview • Benchmark property graph used
– Social network data (1.4 billion edges) – Data loading completed in
• 4.8 hr with 48 threads using 6-‐Node HBase cluster on BDA – No secondary indices built
• 15.8 hrs with 48 threads using 8-‐Node NoSQL EE cluster – Mul@ple secondary indices created
– Full graph scan • 1865 seconds to retrieve 1.4 Billion edges on HBase
– 3 splits per region and 48 threads • 6841 seconds to retrieve 1.4 Billion edges on NoSQL
– maxConcurrentRequests 32 used in TableIteratorOp@ons
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Performance Inves@ga@ons
0
2000
4000
6000
8000
10000
12000
14000
16000
1 6 11 16 21 26 31
Time (secs)
DOP
LiveJ graph on HBase
Load
0
100
200
300
400
500
600
700
1 6 11 16 21 26 31
Time (secs)
DOP
LiveJ graph on HBase
getEdges()
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Performance Inves@ga@ons
0
10000
20000
30000
40000
50000
60000
70000
80000
1 6 11 16 21 26 31
Time (secs)
DOP
LiveJ graph on NoSQL
Load
0
200
400
600
800
1000
1200
1400
1600
1 6 11 16 21 26 31 Time (secs)
DOP
LiveJ graph on NoSQL
getEdges()
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Program Agenda
Introduc)on to Graphs
Data Management for Property Graphs in Hadoop
PGX – Parallel In-‐Memory Graph Analy)cs
1
2
3
Current Direc)on for Graph Research
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
PGX – Parallel In-‐Memory Graph Analy)cs
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Graph Analysis • Data analysis that considers rela@onships between data en@@es
– By modelling your data as a graph
• Examples
Purchase Record
customer items
Product Recommenda@on Influencer Iden@fica@on
Communica@on Stream (e.g. tweets)
Graph Pajern Matching Community Detec@on
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
PGX: Graph Analysis Frameworks • Large graph analysis is @me-‐consuming because …
– The computa@on typically involves touching most nodes and edges in the graph
– The data-‐access pajern is, inherently, non-‐sequen@al (random) • Commodity HW and Rela@onal SW is op@mized for sequen@al access
• PGX is an in-‐memory, parallel framework for fast graph analy@cs
• PGX exploits the architecture of modern servers – The computa@on is parallelized using mul@ple CPU cores
– The non-‐sequen@al data-‐access is mi@gated with large DRAMs
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Consequently, PGX (single machine) is much faster than exis@ng – Distributed execu@on or – Out-‐of-‐core execu@on
PGX Performance (In-‐memory)
0.01 0.1 1
10 100
LiveJ Twijer Run)
me in Secon
ds
PageRank
PGX (SPARC) PGX (X86) GraphLab (X86 x 16) SQL (X86)
1
10
100
1000
10000
100000
LiveJ Web-‐UK Run)
me in Secon
ds
Triangle Coun)ng
Execu@on @me of two popular graph analysis algorithms (log scale, lower is bejer)
• PGX (In-‐memory) : x86 and SPARC (T5) • GraphLab (state-‐of-‐art distributed framework) • SQL: disk-‐based
3x – 10x faster than 16-‐machine distributed execu@on
Two orders-‐of-‐magnitude faster than disk-‐based execu@on
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Oracle Property Graph Protoype and PGX • PGX can load up a graph (through data access layer) from: – Property Graph or RDF Seman@c Graph from Oracle Database
– Stored in Oracle Database, Apache HBase or Oracle NoSQL Database
• PGX and the Data Access Layer – PGX handles analy@c workloads while the data access layer handles transac@onal workloads
• PGX supports automa@c refresh from an Oracle Database via delta-‐update
Oracle Property Graph or RDF (RDB, Hbase or NoSQL)
PGX
Analy@c Request Analy@c Request
Analy@c Request Analy@c Request Analy@c Request Analy@c Request
Trans-‐ac@onal Request
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Snapshot Consistency • PGX in-‐memory graph provides a consistent snapshot to a client session
• DB changes are tracked by PGX; new instances are created quickly for new clients • Exis@ng clients can also explicitly refresh the graph instance • It becomes trivial to support many clients, consistently, by adding more servers
Oracle property graph prototype
Graph Instance
client
Delta
PGX …
Graph Instance ‘
New client
refresh
PGX
Graph Instance
Graph Instance ‘
Client Client Client
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• PGX provides a rich set of built-‐in (parallel) graph algorithms
• as well as parallel graph muta@on opera@ons
Built-‐in Algorithms and Graph Muta@on
Detec)ng Components and Communi)es
Tarjan’s, Kosaraju’s, Weakly Connected Components, Label Propaga@on (w/ variants), Soman and Narang’s Spacifica@on
Ranking and Walking
Pagerank, Personalized Pagerank, Betwenness Centrality (w/ variants), Closeness Centrality, Degree Centrality, Eigenvector Centrality, HITS, Random walking and sampling (w/ variants)
Evalua)ng Community Structures
Conductance, Modularity Clustering Coefficient (Triangle Coun@ng) Adamic-‐Adar
Path-‐Finding
Hop-‐Distance (BFS) Dijkstra’s, Bi-‐direc@onal Dijkstra’s Bellman-‐Ford’s
Link Predic)on SALSA (Twijer’s Who-‐to-‐follow)
Other Classics Vertex Cover Minimum Spanning-‐Tree(Prim’s)
a
d
b e
g
c i
f
h
The original graph a
d
b e
g
c i
f
h
Create Undirected Graph
Simplify Graph
a
d
b e
g
c i
f
h
Le� Set: “a,b,e” a d
b
e
g
c
i
Create Bipar)te Graph
g e b d i a f c h
Sort-‐By-‐Degree (Renumbering)
Filtered Subgraph
d
b g
i
e
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• PGX allows users to implement their own graph algorithms with an intui@ve DSL – Our compiler transforms the given program for PGX execu@on
– While applying op@miza@ons and paralleliza@on
• DSL gives you performance benefits over using low-‐level APIs – No repeated interac@on between client and sever
– Direct access to internal data representa@on – Automated paralleliza@on from the compiler
Domain-‐Specific Language (DSL) for Custom Algorithms
foreach(t: G.nodes) { val = 0.85 + d * sum(w:t.inNbrs){w.PR/w.degree()}; ... }
DSL Program
Compiler PGX
Graph Sever
Graph Analy@c Client
Low-‐level API
getNode()
getEdge().getProperty()
.getNbr().
…
The compiler parallelizes graph algorithm if possible
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Execu@on from Hadoop Environments • PGX is integrated with Hadoop-‐based clusters
– Via YARN
• Execu)on modes:
PGX
RDF / HBASE / NoSQL / HDFS
Client ini@ate PGX as a YARN task
Interac@ve Execu@on
pgx> :loadGraph mygraph.json …
Hadoop Cluster
Client
Client controls PGX via an interac@ve shell
... pgx> :pagerank mygraph 0.85 …
To load the Graph and run the analysis
Shared Server Execu@on
PGX Sever
RDF / HBASE / NoSQL / HDFS
PGX can be configured to run as a service, with certain graphs pre-‐loaded And shared by mul@ple
clients
Batch Mode
:loadGraph … … :pagerank …
Client can submit a PGX script as a batch job
PGX
Dry Run (Local Execu@on)
… Client can run PGX locally with small data set Data File
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Distributed Graph Analysis • But what about graphs that do not fit in a single machine?
Answer: PGX.DIST, a distributed graph processing framework – The graph is spread across mul@ple machines in a cluster
– The computa@on is distributed
– The communica@on is op@mized for bandwidth
• Early prototype available for customer POCs
Interconnection Network
……
Computa@on Machines
In-memory Graph (distributed)
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
PGX Distributed Performance • Our prototype is already faster and more scalable than the state-‐of-‐the-‐art graph framework
0
5
10
15
20
25
PGX.DIST GraphLab PGX.DIST GraphLab PGX.DIST GraphLab
Pagerank SSSP WCC
Rela)ve Perform
ance
(Highe
r is Be^
er)
Xeon E5-‐2660 (2.20 GHz, 2 socket x 8 core x 2 threads), 2 -‐-‐ 16 machine, connected by Mellanox Connect-‐IB Graph: Twijer 2010 (1.4 billion edges)
2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16
PGX.DIST performs much bejer when using same # of machines
It also scales bejer with increasing # of machines
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Compiler Support and Integra@on (on-‐going) • We are adding the compiler support for DSL
– Goal: the same DSL program is compiled into PGX.DIST
• As a result, the user can easily migrate execu@on environments – Depending on the size of their data
Compiler
DSL Program
Laptop
Cluster (distributed)
Server (in-‐memory)
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Summary • Property graph prototype reference architecture on Hadoop • Persistent and scalable property graph data management
• Parallel Graph Analy@cs (PGX)
• Prototype provides a solid founda)on for large scale graph applica@ons by combining – Fast, parallel, extensible in-‐memory graph analy@cs – Fast, scalable, persistent graph data management
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
If you like what you heard, sign up for beta tes)ng
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Appendix
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
• Flexible Key-‐Value Data Model
• ACID transac@ons • Horizontally Scalable • Highly Available • Elas@c Configura@on • Simple administra@on
• Intelligent Driver • Commercial grade so�ware and support
Features
Oracle NoSQL Database Enterprise Edi@on Scalable, Highly Available, Key-‐Value Database
Applica@on
Storage Nodes Datacenter B
Storage Nodes Datacenter A
Applica@on
NoSQL DB Driver
Applica@on
NoSQL DB Driver
Applica@on
Java SE 6 (JDK 1.6.0 u25)+; Solaris or Linux
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Comparing Graph Databases • Modeling capabili@es
• APIs and analy@cal func@ons
• Performance and scalability • Enterprise features • Interoperability with other data types • Tools integra@on
Copyright © 2014, Oracle and/or its affiliates. All rights reserved. |
Scalable and Persistent Database
Graph Data Access API
REST/Web Service
Java APIs
NoSQL Database Apache HBase
Parallel Graph Analy)cs (PGX)