cassandra - o'reilly mediaassets.en.oreilly.com/1/event/27/cassandra_ open source...
TRANSCRIPT
![Page 1: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/1.jpg)
Cassandra
Jonathan Ellis
![Page 2: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/2.jpg)
Motivation
● Scaling reads to a relational database is hard
● Scaling writes to a relational database is virtually impossible● … and when you do, it usually isn't relational
anymore
![Page 3: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/3.jpg)
The new face of data
● Scale out, not up● Online load balancing, cluster growth● Flexible schema● Key-oriented queries● CAP-aware
![Page 4: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/4.jpg)
CAP theorem
● Pick two of Consistency, Availability, Partition tolerance
![Page 5: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/5.jpg)
Two famous papers
● Bigtable: A distributed storage system for structured data, 2006
● Dynamo: amazon's highly available key-value store, 2007
![Page 6: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/6.jpg)
Two approaches
● Bigtable: “How can we build a distributed db on top of GFS?”
● Dynamo: “How can we build a distributed hash table appropriate for the data center?”
![Page 7: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/7.jpg)
10,000 ft summary
● Dynamo partitioning and replication● Log-structured ColumnFamily data model
similar to Bigtable's
![Page 8: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/8.jpg)
Cassandra highlights
● High availability● Incremental scalability● Eventually consistent● Tunable tradeoffs between consistency
and latency● Minimal administration● No SPF
![Page 9: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/9.jpg)
![Page 10: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/10.jpg)
![Page 11: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/11.jpg)
![Page 12: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/12.jpg)
![Page 13: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/13.jpg)
Dynamo architecture & Lookup
![Page 14: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/14.jpg)
Architecture details
● O(1) node lookup● Explicit replication● Eventually consistent
![Page 15: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/15.jpg)
Architecture layers
Messaging service
Gossip
Failure detection
Cluster state
Partitioner
Replication
Commit log
Memtable
SSTable
Indexes
Compaction
Tombstones
Hinted handoff
Read repair
Bootstrap
Monitoring
Admin tools
![Page 16: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/16.jpg)
Writes
● Any node● Partitioner● Commitlog, memtable● SSTable● Compaction● Wait for W responses
![Page 17: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/17.jpg)
Memtable / SSTable
Commit log
Disk
![Page 18: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/18.jpg)
SSTable format
● Key / data
![Page 19: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/19.jpg)
SSTable Indexes
● Bloom filter● Key● Column
(Similar to Hadoop MapFile / Tfile)
![Page 20: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/20.jpg)
Compaction
● Merge keys● Combine columns● Discard tombstones
![Page 21: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/21.jpg)
Remove
● Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction
● Read repair complicates things a little● Eventually consistent complicates things
more● Solution: configurable delay before
tombstone GC, after which tombstones are not repaired
![Page 22: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/22.jpg)
Cassandra write properties
● No reads● No seeks● Fast● Atomic within ColumnFamily● Always writable
![Page 23: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/23.jpg)
Read path
● Any node● Partitioner● Wait for R responses● Wait for N – R responses in the
background and perform read repair
![Page 24: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/24.jpg)
Cassandra read properties
● Read multiple SSTables● Slower than writes (but still fast)● Seeks can be mitigated with more RAM● Scales to billions of rows
![Page 25: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/25.jpg)
Consistency in a BASE world
● If W + R > N, you will have consistency● W=1, R=N● W=N, R=1● W=Q, R=Q where Q = N / 2 + 1
![Page 26: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/26.jpg)
vs MySQL with 50GB of data
● MySQL● ~300ms write
● ~350ms read
● Cassandra● ~0.12ms write
● ~15ms read
● Achtung!
![Page 27: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/27.jpg)
Data model
● Rows, ColumnFamilies, Columns
![Page 28: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/28.jpg)
ColumnFamilies
keyA column1 column2 column3
keyC column1 column7 column11
Column
Byte[] Name
Byte[] Value
I64 timestamp
![Page 29: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/29.jpg)
Super ColumnFamilies
keyF Super1 Super2
keyJ Super1 Super5
column column column column column column
column column column column column column
![Page 30: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/30.jpg)
Types of queries
● Single column● Slice
● Set of names / range of names
● Simple slice -> columns
● Super slice -> supercolumns
● Key range
![Page 31: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/31.jpg)
Range queries
● Add “master” server● Implement on top of K/V● Order-preserving partitioning
![Page 32: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/32.jpg)
Modification
● Insert / update● Remove● Single column or batch● Specify W, number of nodes to wait for
![Page 33: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/33.jpg)
Thriftstruct Column { 1: binary name, 2: binary value, 3: i64 timestamp,}
struct SuperColumn { 1: binary name, 2: list<Column> columns,}
Column get_column(table, key, column_path, block_for=1)
list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100)
void insert(table, key, column_path, value, timestamp, block_for=0)
void remove(tablename, key, column_path_or_parent, timestamp)
![Page 34: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/34.jpg)
Honestly, Thrift kinda sucks
![Page 35: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/35.jpg)
Example: a multiuser blog
Two queries
- the most recent posts belonging to a given blog, in reverse chronological order
- a single post and its comments, in chronological order
![Page 36: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/36.jpg)
First try
JBE blog
Cassandra is teh awesome BASE FTW
Evan blog
I like kittens And Ruby
post comment comment post comment comment
post comment comment post comment comment
<ColumnFamily
Type="Super"
CompareWith="TimeString"
CompareSubcolumnsWith="UUID"
Name="Blog"/>
![Page 37: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/37.jpg)
Second try
<ColumnFamily
CompareWith="UUIDType"
Name="Blog"/>
JBE blog Cassandra is teh awesome
BASE FTW
Evan blog I like kittens And Ruby
Cassandra is teh awesome
comment comment
Base FTW comment comment
I like kittens
comment comment
And Ruby comment comment
<ColumnFamily
CompareWith="UUIDType"
Name="Comment"/>
![Page 38: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/38.jpg)
Roadmap
![Page 39: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/39.jpg)
Cassandra 0.3
● Remove support● OPP / Range queries● Test suite● Workarounds for JDK bugs● Rudimentary multi-datacenter support
![Page 40: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/40.jpg)
Cassandra 0.4
● Branched May 18● Data file format change to support billions
of rows per node instead of millions● API changes (no more colon delimiters)● Multi-table (keyspace) support● LRU key cache● fsync support● Bootstrap● Web interface
![Page 41: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/41.jpg)
Cassandra 0.5
● Bootstrap● Load balancing
● Closely related to “bootstrap done right”
● Merkle tree repair● Millions of columns per row
● This will require another data format change
● Multiget● Callout support
![Page 42: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/42.jpg)
Users
Production: facebook, RocketFuel
Production RSN: Digg, Rackspace
No date yet: IBM Research, Twitter
Evaluating: 50+ in #cassandra on freenode
![Page 43: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/43.jpg)
More
● Eventual consistency: http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
● Introduction to distributed databases by Todd Lipcon at NoSQL 09: http://www.vimeo.com/5145059
● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAndPresentations
● #cassandra on irc.freenode.net
![Page 44: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006](https://reader031.vdocuments.us/reader031/viewer/2022022518/5b14ab737f8b9a397c8e4898/html5/thumbnails/44.jpg)
Cassandra