cassandra

Cassandra

Scalable Distributed Data Store Sylvain Lebresne([email protected])

mailto:[email protected]

mailto:[email protected]

A few dates

• Created by Facebook (around 2007)

• Open-sourced in 2008

• Becomes a ASF incubator project in January 2009

• Graduated as a top-level ASF project in February 2010

• 3 major releases (current is 0.6)

• In production in multiple companies (Facebook, Digg, Twitter, Reddit, Rackspace, etc.) with largest cluster of over 150 machines

Cassandra

Big TableDynamo

distribution modeldata model and

storage architecture

Why Cassandra ?

• Fully distributed (client can connect to any node)

• No single point of failure

• Incremental scalability

• Richer data model than simple key-value

• Data center aware

• Fast reads, faster writes (“optimize for reads, writes are cheap”)

• Always writable

• Eventually consistent

Eventual Consistency

• Isn’t consistency important ?

• Yes, eventual consistency:

• Not : “Let’s not be consistent”

• But more: “Instead of designing (costly) measure to prevent inconsistency, we acknowledge that the cluster may be in an inconsistent state for a brief period of time, and we deal with it”

• Moreover, Cassandra allows the client to choose a trade-off between consistency and latency

Data Model

Data model

• A distributed multi-level hash map:

Keyspace Column Family row key

column key

super column key column key

value

value

(one by application)

(a few) (as many as you want)

(millions is ok but limited by node

capacity)

(millions is ok but limited by node

capacity)

(in current implementation, not too many)

name value ts

Column

Columns

Column

“fname” “John”

Columns


“lname” “Doe”

“phone” 0612346789

“age” 35

“picture” 0x0FC3...

Row

ident42

Rows


“lname” “Doe”

“phone” 0612346789

“age” 35

“picture” 0x0FC3...

ident42

“fname” “Chuck”

“lname” “Norris”

“age” 70

“email” “[email protected]”

“picture” 0x159A...

ident123

“fname” “Sylvain”

“lname” “Lebresne”

“phone” 0698765432

“age” 29

ident24

Column Family

Column Family

1983483 msg11

1983490 msg12

1983512 msg13

1983618 msg14

ident24

1990310 msg21

1991321 msg22

2015672 msg23

ident123

ident42

1847820 msg31

1848923 msg32

1848924 msg33

1983618 msg34

ident42

ident123

(Super) Column Family

SuperColumns

Comparison

• Columns and super columns are sorted.

• This sorting is customizable and defined by column family.

• Predefined sorts are:

• BytesType

• LongType

• AsciiType

• UTF8Type

• LexicalUUIDType

• TimeUUIDType

API

• Writes:

• insert() : insert/update a single column

• remove() : remove a column/super column/row

• batch_mutate() : update/remove multiple columns

• Reads:

• get() : retrieve a single column

• get_slice() : retrieve a group of columns (by names or range)

• get_range_slices() : retrieve a set of slices for a range of (row) keys

• count() : count the number of columns in a row

Cassandra Cluster: Replication & Consistency

Ring (Consistent Hashing)

RF = 3• Data distribution:

• take a hash function (md5) and place node on the domain of this hash

• each node is “responsible” of the key that falls between its position and the preceding node

• to know where to store a column, use node responsible of md5(row key)

• Data replication:

• cluster have a replication factor (RF).

• place replicas on preceding nodes

Writing - Cluster side

24

24

24


24

24

24

insert( )42


24

24

24

insert( )42

• Consistency Level : how many node must respond for success ?


24

24

24

insert( )42

• Consistency Level : how many node must respond for success ?• CL.ZERO : none


24

24

24

insert( )42

• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one


24

24

24

insert( )42

• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one• CL.QUORUM : one more

than half the replicas


24


than half the replicas42

42

ok


24



42




42

42

Reading - Cluster side

42

42

get( )

42


42

42

get( )

• Consistency Level : how many node must respond for success ?

42


42

42

get( )

• Consistency Level : how many node must respond for success ?• CL.ONE : one

42


42

42

get( )

• Consistency Level : how many node must respond for success ?• CL.ONE : one• CL.QUORUM : one more


42


42

42

get( )



42

42


42

42

get( )


than half the replicas• If values differs, returns the one

with greater timestamp

42

42

Failure and Consistency

To repair inconsistency when they occurs:

1. Hinted Handoff: when a node is down, insertions are send to another machine. Those insertions are sent to the node come back alive.

2. Read Repair: on reads, if values differ, the out of sync nodes are repaired by inserting the newer value.

3. Anti Entropy: compare versions in two nodes using merkle tree (manual operation).

A Cassandra node

Write Path

1. write commit log (for persistency)

2. write memtable (write is acknowledged to client)

3. if memtable reach treshold, flush to disk as SSTable.

4. Remark: deletion amounts to the insertion of a “tombstone”.

Read Path

• Versions of the same column can be at the same time:

• in the memtable

• in the memtables being flushed

• in one or multiple SSTable

• We need to read all version and resolve using timestamp

• But:

• bloom filters allow to skip reading unnecessary files

• SSTable are indexed

• Compaction keep things reasonnable

Compaction

• Runs regularly as a background operation

• Merge SSTables together

• Get rid of old and deleted values

But...

• Requires disk space temporarily

• As of today, needs to deserialize each row entirely

cassandra

Technology