(dat203) building graph databases on aws
TRANSCRIPT
![Page 1: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/1.jpg)
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Todd Hildebrant and Matthew Sowders
AWS
October 2015
DAT203
Graph Databases on AWS
![Page 2: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/2.jpg)
What to Expect from the Session
• Who are we?
• General overview of graph database technology
• AWS architecture examples
• Amazon Fulfillment technology’s “Inventory Notification
Graph”
• Amazon DynamoDB Storage Backend for Titan
![Page 3: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/3.jpg)
Graph databases on AWS
![Page 4: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/4.jpg)
What is a graph? What is a graph database?
• A graph is a data structure consisting of vertexes
(nodes), directed edges (relationships), and properties.
Subset of tree data structure.
• A graph database uses a property graph as the data
model and includes a query language.
• Other possible data models are hyper-graphs, triple-
stores, RDF.
![Page 5: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/5.jpg)
Graph data modeling
• NoSQL data models – Document, Key-Value, Columnar,
Graph, Mixed
• CAP and ACID
• Start with the use case, then develop the data model:
• As a Student, I want to know other Students in my Class who
know about a Subject
• Student KNOWS Subject, Student BELONGS_TO Class
StudentSubject Class
KNOWS BELONGS_TO
![Page 6: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/6.jpg)
Graph vs. relational database
Graph
• Need to traverse a graph
without JOINs
• Queries have a starting
location MATCH ON x
• Normalized attribute to
enable filtering
• Dynamic schema
Relational
• Columnar analytics
• Tables denormalized for
performance
• Cluster and fault
management
• Recursive query support in
the query optimizer
![Page 7: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/7.jpg)
Titan: distributed graph database
• Distributed graph
• Storage layer has plug-in architecture
• Native TinkerPop implementation
• Full text search with Lucene, SOLR, Elasticsearch
• HA using multi-master replication (Cassandra cluster)
• Scalability using DynamoDB
![Page 8: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/8.jpg)
• Shared-nothing architecture, single master (writes),
multiple replicas (reads), embeddable using JVM
• HA when distributed, uses Paxos for master election
• Attempts to load DB into RAM, larger is better. Efficient
spilling to disk.
• Primary query language is Cypher, supports Gremlin
![Page 9: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/9.jpg)
AWS deployment for Neo4j
Availability Zone #1
Write ELB
Availability Zone #1
Read ELB
ELB health checks
HTTP GET
/db/manage/server/ha/master
/db/manage/server/ha/slave
/db/manage/server/ha/active
![Page 10: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/10.jpg)
Analytics on graphs
• OLAP not OLTP
• Leverages the Hadoop / MapReduce framework
• GraphX is analytics on Spark in-memory; functional-like,
“declarative” programming model
• Giraph is graph using MapReduce / HDFS; procedural,
vertex-centric programming model
• Aggregation type queries over the entire graph
![Page 11: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/11.jpg)
TinkerPop
• Apache Incubator graph framework supporting both
OLAP and OLTP.
• Gremlin, a query language for graph traversals.
Supports analysis, modification, and queries.
• Gremlin Structured API, a generic connector framework
or API. Interface to a backend graph engine.
![Page 12: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/12.jpg)
Graph DB use cases
• Social
• Recommendation
• Classic network problems
• Deep hierarchies
• Sensor analysis with geo-spatial constraints
• Fraud detection
• Identity and Access Management
![Page 13: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/13.jpg)
Recommendation engine example
neo4j cluster
EMR
Writes Reads
Buy like
item
“People who bought
this item also bought”
Custom
“Something you
recently looked at has
changed”
![Page 14: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/14.jpg)
Inbound fulfillment
![Page 15: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/15.jpg)
Inbound fulfillment data problems
![Page 16: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/16.jpg)
Manual Research
• All tools emit events
• Humans trace the events
• Difficult to follow as search
space increases
• Developed queries, but took
too long to run
Approaches
Unique Identifiers
• Every item gets a unique
identifier
• Easy to get all related events
• Expensive
• Impractical for some items
![Page 17: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/17.jpg)
Inventory notification graph: data model
![Page 18: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/18.jpg)
Why not use a relational or NoSQL database?
• Relational Database
• Knew data volume would be huge and keep growing
• Did not want to vertically scale
• JOINs on table will be expensive
• Use case required high availability
• NoSQL Store
• Would be the same solution without all the functionality built
into the TinkerPop Graph Framework
![Page 19: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/19.jpg)
Why a graph?
• No way to index just the events we need
• Need to perform search from receive to stow and vice
versa; i.e., requires many hops to find the data
• Need to process messages out of order
• Graphs provide a simple mental model
![Page 20: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/20.jpg)
Why Titan?
Tinkerpop
Backend
DynamoDB Local DynamoDB Cassandra HBase BerkeleyDB
Titan
Rexster(graph server)
Blueprints(generic graph API)
Furnace(graph algorithms)
Frames(object-graph mapper)
Gremlin(traversal language)
Pipes(dataflows)
![Page 21: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/21.jpg)
Cassandra
• Highly available
• Existing Titan implementation
• EC2Snitch
• Replication
• RandomPartitioner
![Page 22: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/22.jpg)
Cassandra: Titan lessons learned
• No one on our team had experience managing or
configuring a Cassandra cluster
• Needed to manage a cluster
• Team manually replaces hosts as EC2 swaps them out
• Does not handle time series data well
• We ran two producers against two keyspaces so we
could efficiently drop old data
![Page 23: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/23.jpg)
DynamoDB: Titan
• Massively scalable
• No more tuning and host management
• Team was already familiar with DynamoDB
• Risky because there was no existing Titan
implementation
![Page 24: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/24.jpg)
Inventory notification graph – architecture
![Page 25: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/25.jpg)
DynamoDB: single-item data model
Hash Key (hk) Attribute Attribute Attribute Attribute Attribute
Vertex id 1 Property –
Name Justin
Edge (out) –
Friend: Anna
Edge (out) –
Friend: Kris
Edge (out) –
Likes: Movies
Hidden
Property -
Exists
Vertex id 2 Property –
Name Anna
Edge (out) –
Friend: Justin
Edge (out) –
Likes: Books
Hidden
Property -
Exists
Vertex id 3 Property –
Name Kris
Edge (out) –
Friend: Justin
Edge (out) –
Likes: Movies
Hidden
Property -
Exists
Vertex id 4 Property –
Name Movies
Edge (out) –
Friend: Justin
Edge (out) –
Likes: Kris
Hidden
Property -
Exists
Vertex id 5 Property –
Name Books
Edge (out) –
Friend: Anna
Hidden
Property -
Exists
![Page 26: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/26.jpg)
DynamoDB: multiple-item data model
Hash Key (hk) Range Key (rk) Value (v)
Vertex id 1 Range key
Vertex id 1 Property id Property – Name Justin
Vertex id 1 Edge id Edge (out) – Friend Anna
Vertex id 1 Edge id Edge (out) – Friend Kris
Vertex id 2 Range key
Vertex id 2 Property id Property – Name Anna
Vertex id 2 Edge id Edge (out) – Friend Justin
Vertex id 2 Edge id Edge (out) – Friend
Brooks
![Page 27: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/27.jpg)
DynamoDB: how does it scale?
• Close to 100 billion vertices
• Terabytes of data
• Without corresponding increase in latency
![Page 28: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/28.jpg)
DynamoDB: Titan lessons learned
• Use Titan explicit partitioning on large graph
• Partition across multiple graphs for time series data
• Able to achieve stable performance at scale
![Page 29: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/29.jpg)
How to get started
• GitHub Repository
• DynamoDB Local
• CloudFormation Template
![Page 30: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/30.jpg)
Resources
• Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem
• Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL
Movement by Eric Redmond and Jim R. Wilson
• NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by
Pramod J. Sadalage and Martin Fowler
• Titan Graph Database Integration with DynamoDB: World-class Performance,
Availability, and Scale for New Workloads by Werner Vogels
• Store and Process Graph Data using the DynamoDB Storage Backend for Titan by
Jeff Barr
• Amazon DynamoDB Storage Backend for Titan: Distributed Graph Database by
Matthew Sowders and Alexander Patrikalakis
• Amazon DynamoDB Storage Backend for Titan FAQ
• Amazon DynamoDB Storage Backend for Titan Documentation
![Page 31: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/31.jpg)
Thank you!
![Page 32: (DAT203) Building Graph Databases on AWS](https://reader034.vdocuments.us/reader034/viewer/2022042907/587199e11a28ab044e8b576b/html5/thumbnails/32.jpg)
Remember to complete
your evaluations!