a deep dive into understanding apache cassandra
DESCRIPTION
Inside Cassandra – C* is an interesting piece of software for many reasons, but it is especially interesting in its use of elegant data structures and algorithms. This talk will focus on the data structures and algorithms that make C* such a scalable and performant database. We will walk along the write, read and delete paths exploring the low-level details of how each of these operations work. We will also explore some of the background processes that maintain availability and performance. The goal of this talk is to gain a deeper understanding of C* by exploring the low-level details of its implementation.TRANSCRIPT
Inside Cassandra
Michael Penick
Overview
• To disk and back again• Cassandra Internals by Aaron Morton• Goals– RDBMS comparison to C*– Make educated decisions
I’m configuration
Node 3Node 2
Node 1Node 0
Distributed Hashing
A B
C D
E F
G H
I J
K L
M N
O P
Location = Hash(Key) % # Nodes
Node 4
Node 3Node 2
Node 1Node 0
Distributed Hashing
A B
C D
F G
H
K
J
LP
O
M
I
N
E
% Data Moved = 100 * N / (N + 1)
Consistent Hashing0
Node 1
Node 2Node 3
Node 4
Consistent Hashing0
A
E
I
M
B
F
J
N C
G
K
O
D
H
L
PAdd Node 0
A
E
I
M
B
F
J
N C
G
K
O
D
H
L
P
% Data Moved = 100 * 1 / N
Virtual Nodes
Found: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
num_tokens
initial_token
Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE
Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM
Hinted Handoff0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3and
hinted_handoff_enabled = true
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY
Write locally: system.hints
Note: Doesn’t not count toward consistency level (except ANY)
Tunable Consistency
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY EACH_QUORUM
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6R1
R2
Appends FWD_TO parameter to
message
Read Repair
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6R1
R2
R3Client
SELECT * FROM table USING CONSISTENCY ONE
replication_factor = 3and
read_repair_chance > 0
Write
Memory
Disk
Commit Log
Memtable
K1 C1:V1 C2:V2
K1 C1:V1 C2:V2
SSTable #1
K1 C1:V1 C2:V2
…
… …
Flush when:> commitlog_total_space_in_mb
or> memtable_total_space_in_mb
Write
Memory
Disk
Commit Log
Memtable
K1 C3:V3
K1 C3:V3
SSTable #1 SSTable #2
K1 C1:V1 C2:V2
…
… … …
Note: All writes are sequential!
Physical Volume #1 Physical Volume #2
K1 C3:V3
Commit Log
Mutation #3 Mutation #2 Mutation #1 Commit LogExecutor
Commit LogAllocatorSegment #3 Segment #2 Segment #1 Segment #1
Commit LogFile
Memory
Disk
Commit LogFile
Commit LogFile
Flush! Write!commitlog_segment_size_in_mb
Commit Log
• commitlog_sync1. periodic (default)• commitlog_sync_period_in_ms (default: 10 seconds)
2. batch• commitlog_batch_window_in_ms
Memtable
• ConcurrentSkipListMap<RowPosition, AtomicSortedColumns> rows;
• AtomicSortedColumns.Holder– DeletionInfo deletionInfo; // tombstone– SnapTreeMap<ByteBuffer, Column> map;
• Goals– Fast operations– Fast concurrent access– Fast in-order iteration– Atomic/Isolated operations within a column family
Skip List
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Get 7
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL
Skip List
Insert 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL
Skip List
Insert 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
ConcurrentSkipListMap uses: p = 0.5
Skip List
Insert 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
H 1 3 T
2
H 1 3 T
2
CAS
Skip List
while(true):next = current.nextnew_node.next = nextif(CompareAndSwap(current.next, next,
new_node)):break
Skip List
H 1 3 T
H 1 3 T
2
CAS
I’m lost!
Skip ListH 1 3 T
CAS
H 1 3 T
H 1 3 T
CAS
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
CAS
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Insert 4
1 2 3 5 6 7 8
NIL
NIL
NIL
NIL
4
Skip List
Delete 4
1 2 3 4 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
CAS
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 NIL 5 6 7
NIL
NIL
NIL
NIL
Skip List
Delete 4
1 2 3 5 6 7
NIL
NIL
NIL
NIL
SnapTree
3
2 5
1 4 6
Node Balance Factor
1 0
2 1
3 0
4 0
5 0
6 0
Balance Factor = Height(Left-Subtree) – Height(Right-Subtree)
SnapTree
5
2 6
1 3
4
Node Balance Factor
1 0
2 -1
3 -1
4 0
5 2
6 0
Balance Factor must be -1, 0 or +1
SnapTree5
3
4A
B C
D
5
4
3
A B
C
D
4
3 5
A B C D
Left-Right Case
Left-Left Case
SnapTree3
5
4D
CB
A
3
4
5
DC
B
A
4
3 5
A B C D
Right-Left Case
Right-Right Case
SnapTree5
2 6
1 3
4
Node Balance Factor1 0
2 1
3 1
4 0
5 2
6 0
5
2
6
1
3
4
Node Balance Factor
1 0
2 -1
3 -1
4 0
5 2
6 0
SnapTree
Node Balance Factor1 0
2 1
3 1
4 0
5 2
6 0
5
2
6
1
3
4
Node Balance Factor
1 0
2 1
3 0
4 0
5 0
6 0
3
2 5
1 4 6
Epoch
SnapTree
5
2 6
1 3
4
Root
Lock
4
Version(5) is 0
Version(2) is 0
Does Version(5) == 0?
Insert
Epoch
SnapTree
5
2 6
1 3
4
Root4Get
Version(5) is 0
Version(2) is 0
Does Version(5) == 0?
Epoch
SnapTree
Root
5
2
6
1
3
4
4Get
Does Version(5) == 0?
NO! Go back to 5
Epoch
SnapTree
Root4
3
2 5
1 4 6
3Delete
Lock : (
Epoch
SnapTree
Root4
3
2 5
1 4 6
3Delete
Lock
Epoch
SnapTree
Root4
3
2 5
1 4 6
3Delete
Lock
SetValue(3, null)
SnapTreeEpoch #1
Root
3
2 5
1 4 6
Clone StopDelete
Insert
SnapTreeEpoch #2
Root
3
2 5
1 4 6
Clone
Epoch #3
Root
I’m shared!
SnapTreeEpoch #2
Root
3
2 5
1 4 6
Epoch #3
Root7Insert
SnapTreeEpoch #2
Root
3
2 5
1 4 6
Epoch #3
Root7Insert
3
2 5
1 4 6
SnapTreeEpoch #2
Root
3
2 5
1 4 6
Epoch #3
Root7Insert
3
2 5
1 4 6
SnapTreeEpoch #2
Root
3
2 5
1 4 6
Epoch #3
Root7Insert
3
2 5
1 4 6
7
Snap Tree
C* 2.0.0 - File: db/AtomicSortedColumns.java Line: 307
SSTableFilter.db Data.db
K1
K2
K3
C1
C1
C2
C2
C3
CRC.db0xFFCC23ED
0x1FEA2321
0xCE652133
Index.db
K1
K2
K3
00001
00002
00003
CompressionInfo.db
00001
00002
00003
00001
00004
00006
Compression? NoYes
• CASSANDRA-2319• Promote row index
• CASSANDRA-4885• Remove … per-row
bloom filters
Delete
• Essentially a write (mutation)• Data not remove immediately, but a
tombstone record added• tombstone time > gc_grace = data removed
(compaction)
Bloom Filter
Bloom Filter
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
K1Hash Insert
Bloom Filter
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
K1Hash Insert
Bloom Filter
1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
K1Hash InsertHashHash
hash = murmur3(key) # creates two hashesfor i in count(hash):
result[i] = abs(hash[0] + i * hash[1]) % num_keys)
Bloom Filter
Bloom Filter Probability Calculation
Config: bloom_filter_fp_chance,and
SSTable: number of rows
Num hashes, and
Num bits per entry
Read
Memory
Disk
Memtable
K1 C4:V4
SSTable #2
K1 C3:V3
SSTable #1
K1 C1:V1 C2:V2
…
… …
Memtable
K1 C5:V5
… K1 C4:V4C1:V1 C2:V2 C3:V3 C5:V5
Row Cache
= Off-heap
row_cache_size_in_mb > 0
Read
Memory
Disk
Bloom Filter
Key Cache
Partition Summary
Compression Offsets
Partition Index Data
Cache Hit
Cache Miss
= Off-heap
key_cache_size_in_mb > 0
index_interval = 128 (default)
Compaction (Size-tiered)
Compaction (Size-tiered)
min_compaction_threshold = 4
Memtable flush!
Compaction (Size-tiered)
Compaction (Leveled)
Memtable flush!
Compaction (Leveled)
L0: 160 MB L1: 160 MB x 10
sstable_size_in_mb = 160
L2: 160 MB x 100
Compaction (Leveled)
L0: 160 MB L1: 160 MB x 10 L2: 160 MB x 100
…
Topics
• CAS (PAXOS)• Anti-entropy (Merkel trees)• Gossip (Failure detection)
Thanks