Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved.
Cassandra With No Moving Parts Matt Kennedy
Cassandra Summit: June 12, 2013
Switch your database to flash now. Or you’re doing it wrong. Brian Bulkowski, Aerospike Founder and CTO
June 18, 2013 2 #Cassandra13
http://highscalability.com/blog/2012/12/10/switch-your-databases-to-flash-storage-now-or-youre-doing-it.html
June 18, 2013 3 #Cassandra13
Why?
Flash IOPS Drives Server Adoption
June 18, 2013 4
▸ Capacity ▸ IOPS
▸ Cost per IOP
4TB 3TB 150 200,000
$$$$ ¢¢¢¢
#Cassandra13
June 18, 2013 5 #Cassandra13
What is flash?
NAND Flash Memory
June 18, 2013 6
Flash is a persistent memory technology invented by Dr. Fujio Masuoka at Toshiba in 1980.
Bit Line
Source Line Word Line
Control Gate
Float Gate
N P N
#Cassandra13
Consumer Volume Drives Economics
June 18, 2013 7 #Cassandra13
Flash in Servers
June 18, 2013 8 #Cassandra13
Direct Cut Through Architecture
June 18, 2013 #Cassandra13 9
PC
Ie
DRAM
Host CPU
App OS
LEGACY APPROACH FUSION DIRECT APPROACH
PC
Ie
SA
S
DRAM
Data path Controller
NAND
Host CPU
RAID Controller
App OS
Goal of every I/O operation to move data to/from DRAM and flash.
SC
Super Capacitors
June 18, 2013 10 #Cassandra13
How can we use it in Cassandra?
Cassandra I/O - Writes
June 18, 2013 11
http://www.datastax.com/docs/1.2/dml/about_writes
#Cassandra13
Cassandra I/O - Reads
June 18, 2013 12
http://www.datastax.com/docs/1.2/dml/about_reads
#Cassandra13
Memory
DRAM Dictates Cassandra Scaling
June 18, 2013 13
▸ Key Design Principle: ▸ Working Set < DRAM
#Cassandra13
DO
LL
AR
S
Cost of DRAM Modules
0
200
400
600
800
1000
1200
1400
1600
4GB 8GB 16GB 32GB
#Cassandra13 June 18, 2013 14
$ $$ $$$
$$$$$$
When do we scale out?
June 18, 2013 15
▸ A typical server…
CPU Cores: 32 with HT Memory: 128 GB
…is your working set > 128GB?
#Cassandra13
Is there a better way?
June 18, 2013 16
▸ With NoSQL Databases, we tend to scale out for DRAM
Combined Resources CPU Cores: 96 Memory: 384 GB
More cores than needed to serve reads and writes.
#Cassandra13
Flash Offers A New Architectural Choice
June 18, 2013 #Cassandra13 17
Milliseconds 10-3 Microseconds 10-6 Nanoseconds 10-9
CPU Cache DRAM
Disk Drives
Server-based Flash
Three Deployment Options
June 18, 2013 18
1. All Flash 2. Data Placement (CASSANDRA-2749) 3. Use Logical Data Centers
#Cassandra13
Cassandra with All-Flash Storage
June 18, 2013 #Cassandra13 19
Step 1: Mount ioMemory at /var/lib/cassandra/data Step 2:
Data Placement
June 18, 2013 20
▸ https://issues.apache.org/jira/browse/CASSANDRA-2749 • Thanks Marcus!
▸ Takes advantage of filesystem hierarchy
▸ Use mount points to pin Keyspaces or Column Families to flash: • /var/lib/cassandra/data/{Keyspace}/{CF}
▸ Use flash for high performance needs, disk for capacity needs
#Cassandra13
Data Centers for Storage Control
June 18, 2013 21
DC1 (Interactive requests)
DC3 (High density replicas)
DC2 (Hadoop MR Jobs)
PERFORMANCE
CAPACITY/NODE
HIGH
MEDIUM
LOW
HIGH
Cassandra cluster
#Cassandra13
June 18, 2013 #Cassandra13 22
The Numbers
YCSB Testing Setup
June 18, 2013 23
#Cassandra13
x4 x4
YCSB Load Generator
10GB 16-cores 24GB DRAM
Workloads use uniform random key selection instead of Zipfian.
150 million 1KB records, RF=3: ~ 120GB SSTables/node
YCSB: Bulk Load (CL=ALL)
June 18, 2013 #Cassandra13 24
YC
SB
IN
SE
RT
S
0
10000
20000
30000
40000
50000
60000
70000
10
70
130
190
250
310
370
430
490
550
610
670
730
790
850
910
970
1030
1090
1150
1210
1270
1330
1390
1450
1510
1570
1630
1690
1750
1810
1870
1930
1990
2050
2110
2170
2230
2290
2350
2410
2470
2530
2590
2650
2710
2770
2830
inserts/sec
Avg Latency: 0.9 ms 95th Percen?le: 1 ms 99th Percen?le: 4 ms
95/5 R/W Uniform distribution
June 18, 2013 #Cassandra13 25
MIX
ED
OP
S/S
EC
0
10000
20000
30000
40000
50000
60000
70000
80000
10
30
50
70
90
110
130
150
170
190
210
230
250
270
290
310
330
350
370
390
410
430
450
470
490
510
530
550
570
590
610
630
650
670
690
75 threads 200 threads 300 threads
# threads Avg Lat. 95th pctl 99th pctl
75 1.4/0.22 ms 2/0 ms 5/0 ms
200 3.1/0.19 ms 7/0 ms 13/0 ms
300 4.4/2.2 ms 11/0 ms 19/0 ms
50/50 R/W Uniform distribution 10hrs
June 18, 2013 #Cassandra13 26
YC
SB
MIX
ED
OP
S/S
EC
0
10000
20000
30000
40000
50000
60000
70000
10
730
1450
2170
2890
3610
4330
5050
5770
6490
7210
7930
8650
9370
10090
10810
11530
12250
12970
13690
14410
15130
15850
16570
17290
18010
18730
19450
20170
20890
21610
22330
23050
23770
24490
25210
25930
26650
27370
28091
28811
29531
30251
30971
31691
32411
33131
33851
34571
35291
mixed ops/sec
Update Latency Average: 511 µs 95th Pctl:1 ms 99th Pctl: 2 ms
Read Latency Average: 7.0 ms 95th Pctl: 18 ms 99th Pctl: 42 ms
Write Amplification
June 18, 2013 27 #Cassandra13
Amplification Factor = Physical Bytes Written Workload Bytes Written
Workload Write Amp
Leveled Compaction Load (250MB tier-0)
0.8-1.2x
24-hour mixed workloads
1.2-2.1x
Size-tiered w/Major Compactions (old skool)
3-15x
Workload Type Amplification Factor
Bulk Load 14.8
Normal Operations (80/20 update/insert split)
4.2
Cassandra
Compares favorably to HBase
Next Step in Flash Evolution
June 18, 2013 28
FLASH AS MEMORY
NATIVE FLASH APIs
FLASH AS DISK
#Cassandra13
Rethinking Cassandra I/O
June 18, 2013 29
http://www.datastax.com/docs/1.2/dml/about_writes
Flash
#Cassandra13
Rethinking Cassandra I/O
June 18, 2013 30 #Cassandra13
http://www.datastax.com/docs/1.2/dml/about_writes
Flashtable
Accelerating Cassandra With Flash
June 18, 2013 31
+
#Cassandra13
NAND Flash Accelerator
Real-World Cassandra on Fusion
June 18, 2013 32 #Cassandra13
f u s i o n i o . c o m | R E D E F I N E W H A T ’ S P O S S I B L E
T H A N K Y O U
f u s i o n i o . c o m | R E D E F I N E W H A T ’ S P O S S I B L E
T H A N K Y O U
Cassandra: ioDrive2 vs 10 disk RAID-0
June 18, 2013 34 #Cassandra13
12-hour mixed read/write workload
June 18, 2013 Fusion-io Confidential 35
MIX
ED
WO
RK
LO
AD
0
5000
10000
15000
20000
25000
30000
35000
40000
10
880
1750
2620
3490
4360
5230
6100
6970
7840
8710
9580
10450
11320
12190
13060
13930
14800
15670
16540
17410
18280
19150
20020
20890
21760
22630
23500
24370
25240
26110
26980
27850
28720
29590
30460
31331
32201
33071
33941
34811
35681
36551
37421
38291
39161
40031
40901
41771
42641
CL=1 Reads CL=Q Reads CL=Q Writes (throMled)
50/50 R/W Uniform distribution
June 18, 2013 #Cassandra13 36
YC
SB
MIX
ED
OP
S/S
EC
0
20000
40000
60000
80000
100000
120000
10
30
50
70
90
110
130
150
170
190
210
230
250
270
290
310
330
350
370
390
410
430
450
470
490
510
530
550
mixed ops/sec
Update Latency Average: 311 µs 95th Pctl:0 ms 99th Pctl: 1 ms
Read Latency Average: 8.2 ms 95th Pctl: 20 ms 99th Pctl: 62 ms