graph database with cassandra
TRANSCRIPT
Proprietary and Confidential / © The Nerdery, LLC 2
Graph DatabasesBrandon VeberChad Dvoracek
Proprietary and Confidential / © The Nerdery, LLC 3
Agenda
•Introduction to graph databases–What they are–Why to use them
•Titan technology stack–NoSQL distributed scalable data storage–Spark in-memory distributed computing
•Graph queries and analytics
Proprietary and Confidential / © The Nerdery, LLC 4
Introduction to Graph
Proprietary and Confidential / © The Nerdery, LLC 5
What is a Graph Database?
Graph databases use graph structures such as nodes and edges to store data and relationships.
Entities are modelled as nodes and the relationships between them are modelled as edges.
Blue, J Driving Insights with Network Graphs. Retrieved fromwww.mapr.com/blog/driving-insights-network-graphs
Proprietary and Confidential / © The Nerdery, LLC 6
How is it different from RDBMS?
● Relational databases prioritize the table
● Relationships are ad-hoc in the form of FK constraints
● Querying through complex relationships requires several costly joins
Graph DB vs RDBMS. Accessed from http://neo4j.com/developer/graph-db-vs-rdbms/
Proprietary and Confidential / © The Nerdery, LLC 7
How is it different from RDBMS?
● Nodes contain entities and their corresponding properties
● Relationships are given top priority
● Pointers instead of index look-ups
Graph Database. Accessed from https://en.wikipedia.org/wiki/Graph_database
Proprietary and Confidential / © The Nerdery, LLC 8
How is it different from RDBMS?
● Inherently NoSQL● Scalable● High availability
● Data model is intuitive and agile.
Graph vs RDBMS. Accessed from http://neo4j.com/developer/graph-db-vs-rdbms/
Proprietary and Confidential / © The Nerdery, LLC 9
When to use Graph DB
Proprietary and Confidential / © The Nerdery, LLC 10
When not to use Graph DB
● Data warehousing
● Schema-oriented design
● Aggregates on sets
● Robust transactional processing
Proprietary and Confidential / © The Nerdery, LLC 11
When to use Graph DB
● Graph databases work well with highly interconnected data with complex relationships
● Some use cases include:○ Social networks○ Route planning○ Master data management○ Recommendation engine
AWS Master Data Management Model. Accessed from http://neo4j.com/graphgist/8526106
Proprietary and Confidential / © The Nerdery, LLC 12
Successful Use Cases
Proprietary and Confidential / © The Nerdery, LLC 13
Successful Use Cases - HealthUnlocked
Goal: Redesign system to manage performance issues associated with increasing data volume
Methods:
● Graph database to store relationships between symptoms, conditions and treatments
● Language processing to build multilingual ontology into the database
Result:
● Improved query performance● Easier data model for pattern matching● Two months to launch
Proprietary and Confidential / © The Nerdery, LLC 14
Open Source Graph Framework
● The Apache Tinkerpop project provides an open source, vendor agnostic framework for graph construction, query and analysis.
● Changing between graph engines and back-end storage technologies is possible without significant refactoring
● Supports graph databases (OLTP) and graph analytics (OLAP)
Proprietary and Confidential / © The Nerdery, LLC 15
Titan
Proprietary and Confidential / © The Nerdery, LLC 16
Technology Stack - Storage
● Supports several distributed NoSQL databases
● Support for ACID transactions
● Linearly scalable
Proprietary and Confidential / © The Nerdery, LLC 17
Technology Stack - Analytics & ETL
Titan offers support for several analytics and batch loading technologies.
Proprietary and Confidential / © The Nerdery, LLC 18
Technology Stack - Search + Framework
Titan supports the following search technologies:
•ElasticSearch•Lucene•Solr
Titan also integrates natively with Apache Tinkerpop
Proprietary and Confidential / © The Nerdery, LLC 19
Apache Cassandra
● Key-Value Store
● Exceptional fault tolerance
● Scalable
● Denormalized tables
Proprietary and Confidential / © The Nerdery, LLC 20
Apache Spark
● Resilient distributed datasets
● In-memory cluster computing
● Scalable
● Up to 100x faster than MapReduce
● Native Cassandra connector
Proprietary and Confidential / © The Nerdery, LLC 21
Datastax Graph
● Designed for cloud applications
● Multi-model capable
● Enterprise support
● Scalable
Proprietary and Confidential / © The Nerdery, LLC 22
Queries and Analytics
Proprietary and Confidential / © The Nerdery, LLC 23
Example Model
Proprietary and Confidential / © The Nerdery, LLC 24
Example Model
● Edges can contain values and properties as well
● The ‘Includes’ edge will contain a quantity property
Proprietary and Confidential / © The Nerdery, LLC 25
Simple Traversal
Proprietary and Confidential / © The Nerdery, LLC 26
Traversal ExampleQuestion: What items were purchased in ‘Transaction 1’?
g.V().hasLabel(‘transaction’).has(‘tx_id’,1).out(‘includes’).values(‘name’)
Output
● Pop
● Gum
● Bread
Proprietary and Confidential / © The Nerdery, LLC 27
Traversal ExampleQuestion: What customers have shopped at ‘Store 1’?
g.V().hasLabel(‘store’).has(‘store_id’,1).out(‘processes’).in(‘purchases’).values(‘name’)
Output
● Customer 1
Proprietary and Confidential / © The Nerdery, LLC 28
Branching TraversalQuestion: Of all transactions when ‘Pop’ was purchased what was the average quantity?
g.V().has('name','Pop').inE('includes').values('quantity').mean()
Output● 1.5
Proprietary and Confidential / © The Nerdery, LLC 29
Branching TraversalQuestion: What is the average quantity of all items sold when purchased in a transaction?
g.V().hasLabel('item').local(inE('includes').values('quantity').mean())
Output● 1.5● 2.5● 1
Proprietary and Confidential / © The Nerdery, LLC 30
More Traversal Strategies
● Recursive● Path● Projecting● Declarative
Proprietary and Confidential / © The Nerdery, LLC 31
Graph Analytics - Network Properties
● Node count - Total number of nodes● Edge count - Total number of edges ● Diameter - Maximum length of a shortest path between any two nodes● Min & Max & Mean Degree - Degree is the number of connections for each node● Degree distribution - Histogram (shown on next page)
Proprietary and Confidential / © The Nerdery, LLC 32
Graph Analytics - Degree Distribution
The degree of a node represents how many connections it has. A degree distribution is the probability distribution of those degrees in the network.
Most graphs exhibit the behavior of GitHub distribution shown on the right
BIG GRAPH DATA ON HORTONWORKS DATA PLATFORM, Accessed at http://hortonworks.com/blog/big-graph-data-on-hortonworks-data-platform/
Proprietary and Confidential / © The Nerdery, LLC 33
Graph Analytics - Network Properties
● Clustering coefficients - Represent the randomness of connections in a graph● Centrality - Identify the most important nodes (e.g. PageRank)● Community detection - Identify groups of nodes that are more densely connection
among themselves than the other nodes in the graph
Proprietary and Confidential / © The Nerdery, LLC 34
Questions?