cassandra

38
Cassandra Scalable Distributed Data Store Sylvain Lebresne ([email protected] )

Upload: pcmanus

Post on 19-Aug-2015

2.584 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Cassandra

Cassandra

Scalable Distributed Data Store Sylvain Lebresne([email protected])

Page 2: Cassandra

A few dates

• Created by Facebook (around 2007)

• Open-sourced in 2008

• Becomes a ASF incubator project in January 2009

• Graduated as a top-level ASF project in February 2010

• 3 major releases (current is 0.6)

• In production in multiple companies (Facebook, Digg, Twitter, Reddit, Rackspace, etc.) with largest cluster of over 150 machines

Page 3: Cassandra

Cassandra

Big TableDynamo

distribution modeldata model and

storage architecture

Page 4: Cassandra

Why Cassandra ?

• Fully distributed (client can connect to any node)

• No single point of failure

• Incremental scalability

• Richer data model than simple key-value

• Data center aware

• Fast reads, faster writes (“optimize for reads, writes are cheap”)

• Always writable

• Eventually consistent

Page 5: Cassandra

Eventual Consistency

• Isn’t consistency important ?

• Yes, eventual consistency:

• Not : “Let’s not be consistent”

• But more: “Instead of designing (costly) measure to prevent inconsistency, we acknowledge that the cluster may be in an inconsistent state for a brief period of time, and we deal with it”

• Moreover, Cassandra allows the client to choose a trade-off between consistency and latency

Page 6: Cassandra

Data Model

Page 7: Cassandra

Data model

• A distributed multi-level hash map:

Keyspace Column Family row key

column key

super column key column key

value

value

(one by application)

(a few) (as many as you want)

(millions is ok but limited by node

capacity)

(millions is ok but limited by node

capacity)

(in current implementation, not too many)

Page 8: Cassandra

name value ts

Column

Columns

Page 9: Cassandra

Column

“fname” “John”

Columns

Page 10: Cassandra

“fname” “John”

“lname” “Doe”

“phone” 0612346789

“age” 35

“picture” 0x0FC3...

Row

ident42

Rows

Page 11: Cassandra

“fname” “John”

“lname” “Doe”

“phone” 0612346789

“age” 35

“picture” 0x0FC3...

ident42

“fname” “Chuck”

“lname” “Norris”

“age” 70

“email” “[email protected]

“picture” 0x159A...

ident123

“fname” “Sylvain”

“lname” “Lebresne”

“phone” 0698765432

“age” 29

ident24

Column Family

Column Family

Page 12: Cassandra

1983483 msg11

1983490 msg12

1983512 msg13

1983618 msg14

ident24

1990310 msg21

1991321 msg22

2015672 msg23

ident123

ident42

1847820 msg31

1848923 msg32

1848924 msg33

1983618 msg34

ident42

ident123

(Super) Column Family

SuperColumns

Page 13: Cassandra

Comparison

• Columns and super columns are sorted.

• This sorting is customizable and defined by column family.

• Predefined sorts are:

• BytesType

• LongType

• AsciiType

• UTF8Type

• LexicalUUIDType

• TimeUUIDType

Page 14: Cassandra

API

• Writes:

• insert() : insert/update a single column

• remove() : remove a column/super column/row

• batch_mutate() : update/remove multiple columns

• Reads:

• get() : retrieve a single column

• get_slice() : retrieve a group of columns (by names or range)

• get_range_slices() : retrieve a set of slices for a range of (row) keys

• count() : count the number of columns in a row

Page 15: Cassandra

Cassandra Cluster: Replication & Consistency

Page 16: Cassandra

Ring (Consistent Hashing)

RF = 3• Data distribution:

• take a hash function (md5) and place node on the domain of this hash

• each node is “responsible” of the key that falls between its position and the preceding node

• to know where to store a column, use node responsible of md5(row key)

• Data replication:

• cluster have a replication factor (RF).

• place replicas on preceding nodes

Page 17: Cassandra

Writing - Cluster side

24

24

24

Page 18: Cassandra

Writing - Cluster side

24

24

24

insert( )42

Page 19: Cassandra

Writing - Cluster side

24

24

24

insert( )42

Page 20: Cassandra

Writing - Cluster side

24

24

24

insert( )42

• Consistency Level : how many node must respond for success ?

Page 21: Cassandra

Writing - Cluster side

24

24

24

insert( )42

• Consistency Level : how many node must respond for success ?• CL.ZERO : none

Page 22: Cassandra

Writing - Cluster side

24

24

24

insert( )42

• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one

Page 23: Cassandra

Writing - Cluster side

24

24

24

insert( )42

• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one• CL.QUORUM : one more

than half the replicas

Page 24: Cassandra

Writing - Cluster side

24

• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one• CL.QUORUM : one more

than half the replicas42

42

ok

Page 25: Cassandra

Writing - Cluster side

24

• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one• CL.QUORUM : one more

than half the replicas42

42

Page 26: Cassandra

Writing - Cluster side

• Consistency Level : how many node must respond for success ?• CL.ZERO : none• CL.ONE : one• CL.QUORUM : one more

than half the replicas42

42

42

Page 27: Cassandra

Reading - Cluster side

42

42

get( )

42

Page 28: Cassandra

Reading - Cluster side

42

42

get( )

42

Page 29: Cassandra

Reading - Cluster side

42

42

get( )

• Consistency Level : how many node must respond for success ?

42

Page 30: Cassandra

Reading - Cluster side

42

42

get( )

• Consistency Level : how many node must respond for success ?• CL.ONE : one

42

Page 31: Cassandra

Reading - Cluster side

42

42

get( )

• Consistency Level : how many node must respond for success ?• CL.ONE : one• CL.QUORUM : one more

than half the replicas

42

Page 32: Cassandra

Reading - Cluster side

42

42

get( )

• Consistency Level : how many node must respond for success ?• CL.ONE : one• CL.QUORUM : one more

than half the replicas

42

42

Page 33: Cassandra

Reading - Cluster side

42

42

get( )

• Consistency Level : how many node must respond for success ?• CL.ONE : one• CL.QUORUM : one more

than half the replicas• If values differs, returns the one

with greater timestamp

42

42

Page 34: Cassandra

Failure and Consistency

To repair inconsistency when they occurs:

1. Hinted Handoff: when a node is down, insertions are send to another machine. Those insertions are sent to the node come back alive.

2. Read Repair: on reads, if values differ, the out of sync nodes are repaired by inserting the newer value.

3. Anti Entropy: compare versions in two nodes using merkle tree (manual operation).

Page 35: Cassandra

A Cassandra node

Page 36: Cassandra

Write Path

1. write commit log (for persistency)

2. write memtable (write is acknowledged to client)

3. if memtable reach treshold, flush to disk as SSTable.

4. Remark: deletion amounts to the insertion of a “tombstone”.

Page 37: Cassandra

Read Path

• Versions of the same column can be at the same time:

• in the memtable

• in the memtables being flushed

• in one or multiple SSTable

• We need to read all version and resolve using timestamp

• But:

• bloom filters allow to skip reading unnecessary files

• SSTable are indexed

• Compaction keep things reasonnable

Page 38: Cassandra

Compaction

• Runs regularly as a background operation

• Merge SSTables together

• Get rid of old and deleted values

But...

• Requires disk space temporarily

• As of today, needs to deserialize each row entirely