a deep dive into understanding apache cassandra

Inside Cassandra

Michael Penick

Overview

• To disk and back again• Cassandra Internals by Aaron Morton• Goals– RDBMS comparison to C*– Make educated decisions

I’m configuration

Node 3Node 2

Node 1Node 0

Distributed Hashing

A B

C D

E F

G H

I J

K L

M N

O P

Location = Hash(Key) % # Nodes

Node 4

Node 3Node 2

Node 1Node 0

Distributed Hashing

A B

C D

F G

H

K

J

LP

O

M

I

N

E

% Data Moved = 100 * N / (N + 1)

Consistent Hashing0

Node 1

Node 2Node 3

Node 4

Consistent Hashing0

A

E

I

M

B

F

J

N C

G

K

O

D

H

L

PAdd Node 0

A

E

I

M

B

F

J

N C

G

K

O

D

H

L

P

% Data Moved = 100 * 1 / N

Virtual Nodes

Found: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2

num_tokens

initial_token

http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2





Tunable Consistency

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE

Tunable Consistency

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM

Hinted Handoff0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6

replication_factor = 3and

hinted_handoff_enabled = true

R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY

Write locally: system.hints

Note: Doesn’t not count toward consistency level (except ANY)

Tunable Consistency

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

R3Client

INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY EACH_QUORUM

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

Appends FWD_TO parameter to

message

Read Repair

0

Node 1

Node 2

Node 3Node 4

Node 5

Node 6R1

R2

R3Client

SELECT * FROM table USING CONSISTENCY ONE

replication_factor = 3and

read_repair_chance > 0

Write

Memory

Disk

Commit Log

Memtable

K1 C1:V1 C2:V2

K1 C1:V1 C2:V2

SSTable #1

K1 C1:V1 C2:V2

…

… …

Flush when:> commitlog_total_space_in_mb

or> memtable_total_space_in_mb

Write

Memory

Disk

Commit Log

Memtable

K1 C3:V3

K1 C3:V3

SSTable #1 SSTable #2

K1 C1:V1 C2:V2

…

… … …

Note: All writes are sequential!

Physical Volume #1 Physical Volume #2

K1 C3:V3

Commit Log

Mutation #3 Mutation #2 Mutation #1 Commit LogExecutor

Commit LogAllocatorSegment #3 Segment #2 Segment #1 Segment #1

Commit LogFile

Memory

Disk

Commit LogFile

Commit LogFile

Flush! Write!commitlog_segment_size_in_mb

Commit Log

• commitlog_sync1. periodic (default)• commitlog_sync_period_in_ms (default: 10 seconds)

2. batch• commitlog_batch_window_in_ms

Memtable

• ConcurrentSkipListMap<RowPosition, AtomicSortedColumns> rows;

• AtomicSortedColumns.Holder– DeletionInfo deletionInfo; // tombstone– SnapTreeMap<ByteBuffer, Column> map;

• Goals– Fast operations– Fast concurrent access– Fast in-order iteration– Atomic/Isolated operations within a column family

Skip List

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

Get 7

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 5 6 7

NIL

NIL

NIL

NIL

Skip List

Insert 4

1 2 3 5 6 7

NIL

NIL

NIL

NIL

Skip List

Insert 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

ConcurrentSkipListMap uses: p = 0.5

Skip List

Insert 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

H 1 3 T

2

H 1 3 T

2

CAS

Skip List

while(true):next = current.nextnew_node.next = nextif(CompareAndSwap(current.next, next,

new_node)):break

Skip List

H 1 3 T

H 1 3 T

2

CAS

I’m lost!

Skip ListH 1 3 T

CAS

H 1 3 T

H 1 3 T

CAS

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

CAS

Skip List

Insert 4

1 2 3 5 6 7 8

NIL

NIL

NIL

NIL

4

Skip List

Delete 4

1 2 3 4 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

CAS

Skip List

Delete 4

1 2 3 NIL 5 6 7

NIL

NIL

NIL

NIL

Skip List

Delete 4

1 2 3 5 6 7

NIL

NIL

NIL

NIL

SnapTree

3

2 5

1 4 6

Node Balance Factor

1 0

2 1

3 0

4 0

5 0

6 0

Balance Factor = Height(Left-Subtree) – Height(Right-Subtree)

SnapTree

5

2 6

1 3

4

Node Balance Factor

1 0

2 -1

3 -1

4 0

5 2

6 0

Balance Factor must be -1, 0 or +1

SnapTree5

3

4A

B C

D

5

4

3

A B

C

D

4

3 5

A B C D

Left-Right Case

Left-Left Case

SnapTree3

5

4D

CB

A

3

4

5

DC

B

A

4

3 5

A B C D

Right-Left Case

Right-Right Case

SnapTree5

2 6

1 3

4

Node Balance Factor1 0

2 1

3 1

4 0

5 2

6 0

5

2

6

1

3

4

Node Balance Factor

1 0

2 -1

3 -1

4 0

5 2

6 0

SnapTree

Node Balance Factor1 0

2 1

3 1

4 0

5 2

6 0

5

2

6

1

3

4

Node Balance Factor

1 0

2 1

3 0

4 0

5 0

6 0

3

2 5

1 4 6

Epoch

SnapTree

5

2 6

1 3

4

Root

Lock

4

Version(5) is 0

Version(2) is 0

Does Version(5) == 0?

Insert

Epoch

SnapTree

5

2 6

1 3

4

Root4Get

Version(5) is 0

Version(2) is 0


Epoch

SnapTree

Root

5

2

6

1

3

4

4Get


NO! Go back to 5

Epoch

SnapTree

Root4

3

2 5

1 4 6

3Delete

Lock : (

Epoch

SnapTree

Root4

3

2 5

1 4 6

3Delete

Lock

Epoch

SnapTree

Root4

3

2 5

1 4 6

3Delete

Lock

SetValue(3, null)

SnapTreeEpoch #1

Root

3

2 5

1 4 6

Clone StopDelete

Insert

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Clone

Epoch #3

Root

I’m shared!

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

3

2 5

1 4 6

SnapTreeEpoch #2

Root

3

2 5

1 4 6

Epoch #3

Root7Insert

3

2 5

1 4 6

7

Snap Tree

C* 2.0.0 - File: db/AtomicSortedColumns.java Line: 307

SSTableFilter.db Data.db

K1

K2

K3

C1

C1

C2

C2

C3

CRC.db0xFFCC23ED

0x1FEA2321

0xCE652133

Index.db

K1

K2

K3

00001

00002

00003

CompressionInfo.db

00001

00002

00003

00001

00004

00006

Compression? NoYes

• CASSANDRA-2319• Promote row index

• CASSANDRA-4885• Remove … per-row

bloom filters

https://issues.apache.org/jira/browse/CASSANDRA-2319




Delete

• Essentially a write (mutation)• Data not remove immediately, but a

tombstone record added• tombstone time > gc_grace = data removed

(compaction)

Bloom Filter

Bloom Filter

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

K1Hash Insert

Bloom Filter

0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

K1Hash Insert

Bloom Filter

1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0

K1Hash InsertHashHash

hash = murmur3(key) # creates two hashesfor i in count(hash):

result[i] = abs(hash[0] + i * hash[1]) % num_keys)

Bloom Filter

Bloom Filter Probability Calculation

Config: bloom_filter_fp_chance,and

SSTable: number of rows

Num hashes, and

Num bits per entry

Read

Memory

Disk

Memtable

K1 C4:V4

SSTable #2

K1 C3:V3

SSTable #1

K1 C1:V1 C2:V2

…

… …

Memtable

K1 C5:V5

… K1 C4:V4C1:V1 C2:V2 C3:V3 C5:V5

Row Cache

= Off-heap

row_cache_size_in_mb > 0

Read

Memory

Disk

Bloom Filter

Key Cache

Partition Summary

Compression Offsets

Partition Index Data

Cache Hit

Cache Miss

= Off-heap

key_cache_size_in_mb > 0

index_interval = 128 (default)

Compaction (Size-tiered)


min_compaction_threshold = 4

Memtable flush!

Compaction (Leveled)

Memtable flush!


L0: 160 MB L1: 160 MB x 10

sstable_size_in_mb = 160

L2: 160 MB x 100


L0: 160 MB L1: 160 MB x 10 L2: 160 MB x 100

…

Topics

• CAS (PAXOS)• Anti-entropy (Merkel trees)• Gossip (Failure detection)

Thanks

a deep dive into understanding apache cassandra

Technology

nil nil nil nil cas

snaptree node balance

skip list h

skip list whiletrue

snaptree epoch

epoch snaptree root

b c d f g h

log commitlog