graph database with cassandra

Proprietary and Confidential / © The Nerdery, LLC 2

Graph DatabasesBrandon VeberChad Dvoracek


Agenda

•Introduction to graph databases–What they are–Why to use them

•Titan technology stack–NoSQL distributed scalable data storage–Spark in-memory distributed computing

•Graph queries and analytics


Introduction to Graph


What is a Graph Database?

Graph databases use graph structures such as nodes and edges to store data and relationships.

Entities are modelled as nodes and the relationships between them are modelled as edges.

Blue, J Driving Insights with Network Graphs. Retrieved fromwww.mapr.com/blog/driving-insights-network-graphs


How is it different from RDBMS?

● Relational databases prioritize the table

● Relationships are ad-hoc in the form of FK constraints

● Querying through complex relationships requires several costly joins

Graph DB vs RDBMS. Accessed from http://neo4j.com/developer/graph-db-vs-rdbms/



● Nodes contain entities and their corresponding properties

● Relationships are given top priority

● Pointers instead of index look-ups

Graph Database. Accessed from https://en.wikipedia.org/wiki/Graph_database



● Inherently NoSQL● Scalable● High availability

● Data model is intuitive and agile.

Graph vs RDBMS. Accessed from http://neo4j.com/developer/graph-db-vs-rdbms/


When to use Graph DB


When not to use Graph DB

● Data warehousing

● Schema-oriented design

● Aggregates on sets

● Robust transactional processing


When to use Graph DB

● Graph databases work well with highly interconnected data with complex relationships

● Some use cases include:○ Social networks○ Route planning○ Master data management○ Recommendation engine

AWS Master Data Management Model. Accessed from http://neo4j.com/graphgist/8526106


Successful Use Cases


Successful Use Cases - HealthUnlocked

Goal: Redesign system to manage performance issues associated with increasing data volume

Methods:

● Graph database to store relationships between symptoms, conditions and treatments

● Language processing to build multilingual ontology into the database

Result:

● Improved query performance● Easier data model for pattern matching● Two months to launch


Open Source Graph Framework

● The Apache Tinkerpop project provides an open source, vendor agnostic framework for graph construction, query and analysis.

● Changing between graph engines and back-end storage technologies is possible without significant refactoring

● Supports graph databases (OLTP) and graph analytics (OLAP)


Titan


Technology Stack - Storage

● Supports several distributed NoSQL databases

● Support for ACID transactions

● Linearly scalable


Technology Stack - Analytics & ETL

Titan offers support for several analytics and batch loading technologies.


Technology Stack - Search + Framework

Titan supports the following search technologies:

•ElasticSearch•Lucene•Solr

Titan also integrates natively with Apache Tinkerpop


Apache Cassandra

● Key-Value Store

● Exceptional fault tolerance

● Scalable

● Denormalized tables


Apache Spark

● Resilient distributed datasets

● In-memory cluster computing

● Scalable

● Up to 100x faster than MapReduce

● Native Cassandra connector


Datastax Graph

● Designed for cloud applications

● Multi-model capable

● Enterprise support

● Scalable


Queries and Analytics


Example Model


Example Model

● Edges can contain values and properties as well

● The ‘Includes’ edge will contain a quantity property


Simple Traversal


Traversal ExampleQuestion: What items were purchased in ‘Transaction 1’?

g.V().hasLabel(‘transaction’).has(‘tx_id’,1).out(‘includes’).values(‘name’)

Output

● Pop

● Gum

● Bread


Traversal ExampleQuestion: What customers have shopped at ‘Store 1’?

g.V().hasLabel(‘store’).has(‘store_id’,1).out(‘processes’).in(‘purchases’).values(‘name’)

Output

● Customer 1


Branching TraversalQuestion: Of all transactions when ‘Pop’ was purchased what was the average quantity?

g.V().has('name','Pop').inE('includes').values('quantity').mean()

Output● 1.5


Branching TraversalQuestion: What is the average quantity of all items sold when purchased in a transaction?

g.V().hasLabel('item').local(inE('includes').values('quantity').mean())

Output● 1.5● 2.5● 1


More Traversal Strategies

● Recursive● Path● Projecting● Declarative


Graph Analytics - Network Properties

● Node count - Total number of nodes● Edge count - Total number of edges ● Diameter - Maximum length of a shortest path between any two nodes● Min & Max & Mean Degree - Degree is the number of connections for each node● Degree distribution - Histogram (shown on next page)


Graph Analytics - Degree Distribution

The degree of a node represents how many connections it has. A degree distribution is the probability distribution of those degrees in the network.

Most graphs exhibit the behavior of GitHub distribution shown on the right

BIG GRAPH DATA ON HORTONWORKS DATA PLATFORM, Accessed at http://hortonworks.com/blog/big-graph-data-on-hortonworks-data-platform/


Graph Analytics - Network Properties

● Clustering coefficients - Represent the randomness of connections in a graph● Centrality - Identify the most important nodes (e.g. PageRank)● Community detection - Identify groups of nodes that are more densely connection

among themselves than the other nodes in the graph


Questions?


Contact

The [email protected](877) 664.6373

mailto:[email protected]