progressive nosql: cassandra
DESCRIPTION
Tom Wilkie's talk at Progressive NOSQL conference in London on 11/05/12.TRANSCRIPT
A tunably consistent, highly-available, Distributed Database
Tom Wilkie @tom_wilkieFounder & VP Engineering, Acunu
1
• Overview
• Distribution
• Storage
• Datamodel
• Usecases
2
• Overview
• Distribution
• Storage
• Datamodel
• Usecases
3
• A distributed database for Big Data
• Scale out on commodity servers
• Best of bread performance
• Multi-master architecture, no SPOF
• Powerful multi data centre support
4
4
5
5
BigTable, 2006 Dynamo, 2007
Open sourced, 2008
Incubator, 2009TLP, 2010v1.0 2011
6
BigTable: ...
• Simple but powerful datamodel
• Write-optimised storage system
• Consistent, available but not partition tolerant
• Master-slave distribution system, SPOF
http://goo.gl/7T1Ej
7
7
Dynamo: ...
• Sophisticated distribution system with tradable consistency and availability
• Over-simple datamodel
http://goo.gl/Q80b4
8
8
• Overview
• Distribution
• Storage
• Datamodel
• Usecases
9
Distribution: Consistent Hashing
10
r1, c1 → v1r2, c2 → v2r3, c3 → v3
10
Distribution: Scaling
11
11
Distribution: Scaling
12
12
Distribution: Scaling
• .
13
13
Distribution: Scaling
14
14
Distribution: Scaling
15
15
Distribution: Replication
16
r1, c1 → v1
16
Distribution: Replication
17
17
Distribution: ConsistencyTuneable, per-operation consistency
Timestamped values, N > R + W
18
RW
18
Distribution: Read Repair
19
19
Distribution: Read Repair
20
20
Distribution: Read Repair
21
21
Distribution: Read Repair
22
22
• Overview
• Distribution
• Storage
• Datamodel
• Usecases
23
Writing to Cassandra
Row Key Column Column Column Column
24
24
Memtable
In the JVM
On disk Commit log
Writing to Cassandra
25
Row Colu Colu Colu Colu
25
Writing to Cassandra
Full Memtable
26
Commit log
In the JVM
On disk
26
New Memtable
SSTable
27
Writing to CassandraIn the JVM
On disk Commit log
27
Writing to Cassandra
28
SSTableOn disk Commit log SSTable
SSTableSSTableSSTableSSTable
28
Writing to Cassandra
29
SSTable
On disk Commit log
29
Reading from Cassandra
30
MemtableIn the JVM
On disk SSTableCommit log
SSTableindex
Bloom filter
Row cacheOff-heap (no GC)
Key cache
1
2
3 4 5
6
31
31
• Overview
• Distribution
• Storage
• Datamodel
• Usecases
32
SQL Cassandra
Keyspace
Column Family
Database
Tablerow/ col_1 col_1
row/key col_1 col_1row/key col_1 col_2
33
33
col1 col2 col3 col4 col5 col6 col7row1 x x xrow2 x x x x xrow3 x x x x xrow4 x x x xrow5 x x x xrow6 xrow7 x x x
34
34
alice: { m2: { Sender: bob, Subject: ‘paper!’, ... }}
bob: { m1: { Sender: alice, Subject: ‘rock?’, ... }}
charlie: { m1: { Sender: alice, Subject: ‘rock?’, ... }, m2: { Sender: bob, Subject: ‘paper!’, ... }}
35
35
• Overview
• Distribution
• Storage
• Datamodelling
• Usecases
36
37Confidential 6
Location ServicesWeb, SCM, Retail
Fraud Detection
Cloud Monitoring
Oil/Gas Sensors
Social MediaSocial Gaming Ad Marketplaces
Smart Metering
Perfect for high velocity data
Wednesday, 25 April 1237
Not Covered...
• Distribution: Hinted Handoff, Anti-entropy repair, Counter distribution
• Storage: Counter storage, different compaction strategies, partitioning etc
• Datamodel: de-normalisation, TTLs, secondary indexes, CQL, super-columns, schema optional
• Operations: backup, nodetool, performance tuning
• Integration: Hadoop, Client Libraries etc
38
38
• Distributed, scalable database
• Opensource, widely used
• Tunably consistent
• Highly-available
• Partition tolerant
• Write-optimised
• Schema-optional
39
Data Platform
40
Commodity Hardware
Apache Cassandra
Acunu Storage Engine
Control Center
Data driven applications Web UI
Configured and tuned OS
Acunu Analytics
Data Platform
41
Control Center
“I've had the EC2 instance running for a little while and I have to say, I'm impressed. You guys have done well with
this product.”- Lloyd, JustDevelopIt
42
Control Center
“The new UI has been critical in helping us work out what is wrong in our code”
- Matt, TellyBug
43
Castle: Built for Big Data
• Storage engine optimized for large slow disks, many cores, Big Data workloads
• Enterprise density on commodity hardware
• Lightning disk rebuilds:10x faster than RAID
Acunu Kernel
Userspace
Doubling Arrays
arrays range
querieskey
insert
insertqueues
Bloom filters
x
user
spac
ein
terfa
ceke
rnel
spac
ein
terfa
cedo
ublin
g a
rray
map
ping
laye
rm
odlis
t btre
em
appi
ng la
yer
bloc
k m
appi
ng &
cach
eing
laye
r
"Extent" layerextent
allocatorfreespacemanager
btreerange
queries
key get
key insert
Version tree
Streaming interfacekey
insertkey get
bufferedvalue get
bufferedvalue insert
range queries
Cache
flusher
extent blockcache
prefetcher
In-kernel workloads
shared buffersasync, sharedmemory ring
Shared memory interfacekeys
values
Arrays
value arrays
btree
key get
arraysmanagement
merges
• Opensource (GPLv2, MIT for user libraries)
• http://bitbucket.org/acunu
• Loadable Kernel Module, targeting CentOS’s 2.6.18
• http://www.acunu.com/blogs/andy-twigg/why-
Castle
http://goo.gl/gzihe
44
44
45
0
1
2
3
4
5
RAID10, 8 Disks RAID5, 8 Disks RDA, 8 Disks RDA, 15 Disks
Re
bu
ild
Tim
e (
Ho
urs
)
Rebuild time
46
46
Analytics
• Simple, real-time, incremental analytics
• Push processing into ingest phase
events
counterupdates
Acunu Analytics
Click streamSensor data
etc
47
Introduction
49
Live & historicalaggregates...
49
50
Realtime trends...
50
51
Drill downsand roll ups
51
52
Solution Con
Scalability$$$
Not realtimeInefficient Recomputation
Spartan query semantics => complex, DIY solutions
52