choosing the right database
TRANSCRIPT
C H O O S I N G T H E R I G H T D ATA B A S E
D a v i d S i m o n s @ S w a m W i t h Tu r t l e s
C H O O S I N G T H E R I G H T D ATA B A S E
D a v i d S i m o n s @ S w a m W i t h Tu r t l e s
W H O A M I ?
• Tech Lead/Consultant at Softwire
• Consultancy/Implementation
• Design & UX
• Training
W H O A M I ?
• David Simons
• @SwamWithTurtles
• swamwithturtles.com
• Background in Statistics/Simulation
• Enjoys Databases and JavaScript
M Y E X P E R I E N C E W I T H A P I S
• Focused on web development
• Utilise data from our own, or third party APIs to affect rendered data
W H Y A R E W E H E R E ?B U T F I R S T … .
W E T H I N K A P I S A R E I M P O R TA N T
W E ’ R E H E R E B E C A U S E
W E W A N T T O K N O W H O W T O M A K E G O O D A P I S
W E ’ R E H E R E B E C A U S E
W E W A N T T O B U C K T R E N D S ( S A F E LY )
W E ’ R E H E R E B E C A U S E
A P I A R C H I T E C T U R E
D ATA B A S E C O D E Data
A P I A R C H I T E C T U R E
D ATA B A S E C O D E Data
[ O U R W AY ] W O R K S F O R U S !
B U T …
N O S Q L S O L U T I O N S C A N ’ T [ V E R B ]
B U T …
A ( P O L I T I C A L ) A S I D E
T H E AV R E F E R E N D U M
“ N O T O AV ”
W E H AV E T O A S K O U R S E LV E S T H E R I G H T Q U E S T I O N S
W H Y A M I T E L L I N G Y O U T H I S ?
C A N A N O S Q L A P P R O A C H D O E V E R Y T H I N G S Q L C A N ?
S O W E S H O U L D N ’ T A S K
W H AT W O U L D W E W A N T O U R I D E A L D ATA B A S E T O L O O K L I K E ?
I N S T E A D W E S H O U L D A S K
I W A N T T O D I S C U S S
• What sorts of questions should be asking about your data when architecting a system?
• What do the results of those questions mean for your database?
H O W B I G D O E S I T N E E D T O B E ?
A S K Y O U R S E L F
W I T H E N O U G H D ATA , Y O U H AV E T O D I S T R I B U T E
T H E H A R D T R U T H
W H AT H A P P E N S W H E N O N E I N S TA N C E G O E S D O W N ?
B U T …
– D Y N A M O : A M A Z O N ’ S H I G H LY AVA I L A B L E K E Y- VA L U E S T O R E
“Reliability at massive scale is one of the biggest challenges we face at Amazon.com. Even the
slightest outage has significant financial consequences and impacts customer trust.”
– D Y N A M O : A M A Z O N ’ S H I G H LY AVA I L A B L E K E Y- VA L U E S T O R E
“Dynamo targets applications that operate with weaker consistency if this results in high
availability.”
CONS I STENCY
A
B C
CONS I STENCY
A
B C
D Y N A M O I M P L E M E N TAT I O N S
N O T G U A R A N T E E D C O N S I S T E N C Y
T H E C O S T ?
“open source software project that enables distributed processing of large data sets across clusters of
commodity servers”
N O PA R T I T I O N T O L E R A N C E
T H E C O S T ?
C A P T H E O R E M
• Choose Two:
• Consistency
• Availability
• Partition Tolerance
W H AT I F W E D O N ’ T N E E D T O D I S T R I B U T E ?
B U T …
W H AT S H A P E I S Y O U R D ATA ?
A S K Y O U R S E L F
R D B M S ( R E L AT I O N A L D ATA B A S E M A N A G E M E N T S Y S T E M )
E V E R Y R O W I S A “ T H I N G ”
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
S E T- B A S E D O P E R AT I O N
R E A D D A TA O U T W I T H
“ W H E R E ” ( I N T E R S E C T I O N )
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
U N I O N S
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
5 Nemo
6 Moby Dick
7 Wanda
J O I N S
Name SpeciesSpecies Coolness
Rating
1 Puss 0
2 Dinah 0
3 Einstein 10
4 Jess 0
C A R T E S I A N P R O D U C T S
0 10
0 10
0 10
C A R T E S I A N P R O D U C T S
0 10
0 10
0 10
W H AT S Q L D O E S W E L L
• Modelling objects:
• With a fixed structure and shape
• With a limited number of relations
• With no opinion or opinion of any deeper underlying domain
R D B M S ( R E L AT I O N A L D ATA B A S E M A N A G E M E N T S Y S T E M )
T H E R E A R E P R O B L E M S T H I S I S B A D F O R
B U T …
K E V I N B A C O NS I X D E G R E E S O F …
T H E R E I S N O O P E N E L E C T I O N D ATA
T H E P R O B L E M
E L E C T I O N D ATA
E L E C T I O N D ATA
E L E C T I O N D ATA
E = (e.g.) member of, held in,
stood in…
V = elections, constituencies,
years, politicians and parties
W O R L D ’ S L E A D I N G G R A P H D B :
"embedded, disk-based, fully transactional Java persistence engine that stores data structured in
graphs rather than in tables"
D ATA S T O R A G E
D ATA S T O R A G E
D ATA S T O R A G E
• Nodes and edges are all:
• Stored as first-class objects on the file system
• “typed”
• Key-value stores
D ATA I N T H E R E L AT I O N S
• “Joins” are first class objects in the database that can be queried at no additional cost
• Certain queries become trivial (e.g. Joins)
P R O T O T Y P I N G
• Easy to see and work with data
• Schemaless
• Active community with a lot of libraries
N E O 4 J U S E R S
T I M E S E R I E S D ATA B A S E
Timestamp Value
2014-06-10T12:00:00+0100 17
2014-06-10T12:15:00+0100 17
2014-06-10T12:30:00+0100 20
2014-06-10T12:45:00+0100 22
2014-06-10T13:00:00+0100 24
2014-06-10T13:15:00+0100 28
2014-06-10T13:30:00+0100 32
SCHEMALESS
S O C I A L M E D I A S I T E
S O C I A L M E D I A S I T E
E V E R Y R O W I S A “ T H I N G ”
N A M E = P U S S C O O L N E S S = 0
!
N A M E = J E S S C O O L N E S S = 0
!
N A M E = D I N A H C O O L N E S S = 0
!
N A M E = E I N S T E I N C O O L N E S S = 1 0
!
DOCUMENT
Y O U C A N N O T D E N O R M A L I S E D ATA *
WA R N I N G
H O W M U C H A R E Y O U W I L L I N G T O S P E N D ?
A S K Y O U R S E L F
T H I S M AY S E E M L I K E A T R I V I A L P O I N T…
C O S T S …
• Oracle: $50,000+
• SQL Server: $10,000+
O P E N - S O U R C E F R A M E W O R K S
• MySQL
• PostgreSQL
• Riak
• Voldemort
• MariaDB
• Cassandra
• MongoDB
H O W A R E Y O U R E T R I E V I N G T H E D ATA ?
A S K Y O U R S E L F
T H I N G S H AV E T O U S E O U R D ATA …
R E M E M B E R T H A T
A P I A R C H I T E C T U R E
D ATA B A S E C O D E Data
A P I A R C H I T E C T U R E
D ATA B A S E C O D E Data E N D U S E R
A P I A R C H I T E C T U R E
E N D U S E R What is the average age of …?
A P I A R C H I T E C T U R E
E N D U S E REr….
I think it was something like “Campbell”?
O U R C H O I C E I S I N F O R M E D B Y O U R P L A N S F O R T H E A P I
R E M E M B E R T H A T
G E O S PAT I A L I N D E X E S
L I N K E D M E D I A F R A M E W O R K
A PA C H E M A R M O T TA
O U T O F T H E B O X …
T R I P L E VA L U E S T O R E
• Video A contains Alice McSkaterton
• Alice McSkaterton is married to Brock Windsurferling
• Article B contains Brock Windsurferling
D O C U M E N T S T O R E
S E A R C H A P I S
D O C U M E N T S T O R E
E V E R Y R O W I S A “ T H I N G ”
N A M E = P U S S C O O L N E S S = 0
!
N A M E = J E S S C O O L N E S S = 0
!
N A M E = D I N A H C O O L N E S S = 0
!
N A M E = E I N S T E I N C O O L N E S S = 1 0
!
DOCUMENT
A PA C H E L U C E N E
“Apache Lucene is a high-performance, full-featured text search engine library … It is a
technology suitable for nearly any application that requires full-text search”
F O C U S E D A R O U N D T E X T S E A R C H I N G Q U E R I E S
{ "query": { "match": {"hobbies": "skateboard"} } }
{ "query": { {"fuzzy": {"hobbies": “skateboarig"}} } }
{ "query": { {"match": {"hobbies": {"query": "writing reddit comments", "type": "phrase"}}} } }
T O S U M U P…
I S Y O U R D ATA B I G ?
• You’ll need to distribute it - which means you have to choose one of CAP to sacrifice
• Hadoop will provide consistent results, but isn’t built to fail
• Cassandra and Riak will be highly available but slow to propagate new data
I F I T ’ S N O T B I G …
• You can store the data in a high-fidelity, way:
• Denormalised - SQL
• As a graph with Neo4j
• Choosing an unsuitable method is costly!
H O W D O Y O U W A N T I T B A C K ?
• Some databases have custom indexing which we are optimised for some queries
• Geospatial Indexes
• In-text Indexing
• Time Series
S O …
• Your requirements will be unique, so don’t lump with the industry standards
• There are a range of modern DB solutions that are no longer bleeding-edge
• Consider them on their own merits!
A N Y Q U E S T I O N S ?D a v i d S i m o n s
@ S w a m W i t h Tu r t l e s