cassandra
TRANSCRIPT
Cassandra
Scalable Distributed Data Store Sylvain Lebresne([email protected])
A few dates
• Created by Facebook (around 2007)
• Open-sourced in 2008
• Becomes a ASF incubator project in January 2009
• Graduated as a top-level ASF project in February 2010
• 3 major releases (current is 0.6)
• In production in multiple companies (Facebook, Digg, Twitter, Reddit, Rackspace, etc.) with largest cluster of over 150 machines
Cassandra
Big TableDynamo
distribution modeldata model and
storage architecture
Why Cassandra ?
• Fully distributed (client can connect to any node)
• No single point of failure
• Incremental scalability
• Richer data model than simple key-value
• Data center aware
• Fast reads, faster writes (“optimize for reads, writes are cheap”)
• Always writable
• Eventually consistent
Eventual Consistency
• Isn’t consistency important ?
• Yes, eventual consistency:
• Not : “Let’s not be consistent”
• But more: “Instead of designing (costly) measure to prevent inconsistency, we acknowledge that the cluster may be in an inconsistent state for a brief period of time, and we deal with it”
• Moreover, Cassandra allows the client to choose a trade-off between consistency and latency
Data Model
Data model
• A distributed multi-level hash map:
Keyspace Column Family row key
column key
super column key column key
value
value
(one by application)
(a few) (as many as you want)
(millions is ok but limited by node
capacity)
(millions is ok but limited by node
capacity)
(in current implementation, not too many)
name value ts
Column
Columns
Column
“fname” “John”
Columns
“fname” “John”
“lname” “Doe”
“phone” 0612346789
“age” 35
“picture” 0x0FC3...
Row
ident42
Rows
“fname” “John”
“lname” “Doe”
“phone” 0612346789
“age” 35
“picture” 0x0FC3...
ident42
“fname” “Chuck”
“lname” “Norris”
“age” 70
“email” “[email protected]”
“picture” 0x159A...
ident123
“fname” “Sylvain”
“lname” “Lebresne”
“phone” 0698765432
“age” 29
ident24
Column Family
Column Family
1983483 msg11
1983490 msg12
1983512 msg13
1983618 msg14
ident24
1990310 msg21
1991321 msg22
2015672 msg23
ident123
ident42
1847820 msg31
1848923 msg32
1848924 msg33
1983618 msg34
ident42
ident123
(Super) Column Family
SuperColumns
Comparison
• Columns and super columns are sorted.
• This sorting is customizable and defined by column family.
• Predefined sorts are:
• BytesType
• LongType
• AsciiType
• UTF8Type
• LexicalUUIDType
• TimeUUIDType
API
• Writes:
• insert() : insert/update a single column
• remove() : remove a column/super column/row
• batch_mutate() : update/remove multiple columns
• Reads:
• get() : retrieve a single column
• get_slice() : retrieve a group of columns (by names or range)
• get_range_slices() : retrieve a set of slices for a range of (row) keys
• count() : count the number of columns in a row
Cassandra Cluster: Replication & Consistency
Ring (Consistent Hashing)
RF = 3• Data distribution:
• take a hash function (md5) and place node on the domain of this hash
• each node is “responsible” of the key that falls between its position and the preceding node
• to know where to store a column, use node responsible of md5(row key)
• Data replication:
• cluster have a replication factor (RF).
• place replicas on preceding nodes
Writing - Cluster side
24
24
24
Writing - Cluster side
24
24
24
insert( )42
Writing - Cluster side
24
24
24
insert( )42
Writing - Cluster side
24
24
24
insert( )42
• Consistency Level : how many node must respond for success ?
Writing - Cluster side
24
24
24
insert( )42
• Consistency Level : how many node must respond for success ?• CL.ZERO : none
Writing - Cluster side
24
24
24
insert( )42
• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one
Writing - Cluster side
24
24
24
insert( )42
• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one• CL.QUORUM : one more
than half the replicas
Writing - Cluster side
24
• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one• CL.QUORUM : one more
than half the replicas42
42
ok
Writing - Cluster side
24
• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one• CL.QUORUM : one more
than half the replicas42
42
Writing - Cluster side
• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one• CL.QUORUM : one more
than half the replicas42
42
42
Reading - Cluster side
42
42
get( )
42
Reading - Cluster side
42
42
get( )
42
Reading - Cluster side
42
42
get( )
• Consistency Level : how many node must respond for success ?
42
Reading - Cluster side
42
42
get( )
• Consistency Level : how many node must respond for success ?• CL.ONE : one
42
Reading - Cluster side
42
42
get( )
• Consistency Level : how many node must respond for success ?• CL.ONE : one• CL.QUORUM : one more
than half the replicas
42
Reading - Cluster side
42
42
get( )
• Consistency Level : how many node must respond for success ?• CL.ONE : one• CL.QUORUM : one more
than half the replicas
42
42
Reading - Cluster side
42
42
get( )
• Consistency Level : how many node must respond for success ?• CL.ONE : one• CL.QUORUM : one more
than half the replicas• If values differs, returns the one
with greater timestamp
42
42
Failure and Consistency
To repair inconsistency when they occurs:
1. Hinted Handoff: when a node is down, insertions are send to another machine. Those insertions are sent to the node come back alive.
2. Read Repair: on reads, if values differ, the out of sync nodes are repaired by inserting the newer value.
3. Anti Entropy: compare versions in two nodes using merkle tree (manual operation).
A Cassandra node
Write Path
1. write commit log (for persistency)
2. write memtable (write is acknowledged to client)
3. if memtable reach treshold, flush to disk as SSTable.
4. Remark: deletion amounts to the insertion of a “tombstone”.
Read Path
• Versions of the same column can be at the same time:
• in the memtable
• in the memtables being flushed
• in one or multiple SSTable
• We need to read all version and resolve using timestamp
• But:
• bloom filters allow to skip reading unnecessary files
• SSTable are indexed
• Compaction keep things reasonnable
Compaction
• Runs regularly as a background operation
• Merge SSTables together
• Get rid of old and deleted values
But...
• Requires disk space temporarily
• As of today, needs to deserialize each row entirely