Transcript
Page 1: Big Data Grows Up - A (re)introduction to Cassandra

Big Data Grows UpA (re)introduction to Cassandra

Robbie Strickland

Page 2: Big Data Grows Up - A (re)introduction to Cassandra

Who am I?

Robbie StricklandSoftware Development ManagerThe Weather Channel

[email protected]@dont_use_twitter

Page 3: Big Data Grows Up - A (re)introduction to Cassandra

Who am I?

● Cassandra user/contributor since 2010● … it was at release 0.5 back then● 4 years? Oracle DBA’s aren’t impressed● Done lots of dumb stuff with Cassandra● … and some really awesome stuff too

Page 4: Big Data Grows Up - A (re)introduction to Cassandra

Cassandra in 2010

Page 5: Big Data Grows Up - A (re)introduction to Cassandra

Cassandra in 2010

Page 6: Big Data Grows Up - A (re)introduction to Cassandra

Cassandra in 2014

Page 7: Big Data Grows Up - A (re)introduction to Cassandra

Why Cassandra?

It’s fast:

● No locks● Tunable consistency● Sequential R/W● Decentralized

Page 8: Big Data Grows Up - A (re)introduction to Cassandra

Why Cassandra?

It scales (linearly):

● Multi data center● No SPOF● DHT● Hadoop integration

Page 9: Big Data Grows Up - A (re)introduction to Cassandra

Why Cassandra?

It’s fault tolerant:

● Automatic replication● Masterless● Failed nodes

replaced with ease

Page 10: Big Data Grows Up - A (re)introduction to Cassandra

… a lot in the last year (ish)

What’s different?

Page 11: Big Data Grows Up - A (re)introduction to Cassandra

What’s new?

● Virtual nodes● O(n) data moved off-heap● CQL3 (and defining schemas)● Native protocol/driver● Collections● Lightweight transactions● Compaction throttling that actually works

Page 12: Big Data Grows Up - A (re)introduction to Cassandra

What’s gone?

● Manual token management● Supercolumns● Thrift (if you use the native driver)● Directly managing storage rows

Page 13: Big Data Grows Up - A (re)introduction to Cassandra

What’s still the same?

● Still not an RDBMS● Still no joins (see above)● Still no ad-hoc queries (see above again)● Still requires a denormalized data model (^^)● Still need to know what the heck you’re

doing

Page 14: Big Data Grows Up - A (re)introduction to Cassandra

Linear scalability without the migraine

Token Management

Page 15: Big Data Grows Up - A (re)introduction to Cassandra

The old way● 1 token per node● Assigned manually● Adding nodes ==

reassignment of all tokens

● Node rebuild heavily taxes a few nodes

A

BF

C

D

E

cluster with no vnodes

Page 16: Big Data Grows Up - A (re)introduction to Cassandra

… enter Vnodes● n tokens per node● Assigned magically● Adding nodes ==

painless● Node rebuild

distributed across many nodes

A B

C

Dcluster with vnodes

N

M

L

H G

F

E

I

J

K

Page 17: Big Data Grows Up - A (re)introduction to Cassandra

Node rebuild without Vnodes

Page 18: Big Data Grows Up - A (re)introduction to Cassandra

Node rebuild with Vnodes

Page 19: Big Data Grows Up - A (re)introduction to Cassandra

because the JVM sometimes sucks

Going Off-heap

Page 20: Big Data Grows Up - A (re)introduction to Cassandra

Why go off-heap

● GC overhead● JVM no good with big heap sizes● GC overhead● GC overhead● GC overhead

Page 21: Big Data Grows Up - A (re)introduction to Cassandra

O(n) data structures

● Row cache● Bloom filters● Compression offsets● Partition summary

… all these are moved off-heap

Page 22: Big Data Grows Up - A (re)introduction to Cassandra

New memory allocation

native

JVM

heap

Row cacheBloom filtersCompression offsetsPartition summary

Partition key cache

Page 23: Big Data Grows Up - A (re)introduction to Cassandra

Or, how to build a killer data store without a crappy interface

Death of a (Thrift) Salesman

Page 24: Big Data Grows Up - A (re)introduction to Cassandra

Reasons not to ditch Thrift

● Lots of client libraries still use it● You finally got it installed● You didn’t know there was another choice● It sucks less than many alternatives

Page 25: Big Data Grows Up - A (re)introduction to Cassandra

… in spite of all those benefits, you really should ditch Thrift because:

● It requires your entire result set to fit into RAM on both client and server

● The native protocol is better, faster, and supports all the new features

● Thrift-based client libraries are always a step behind

● It’s going away eventually

Page 26: Big Data Grows Up - A (re)introduction to Cassandra

… and did I mention ...

It requires your entire result set to fit into RAM

on both client and server!!!

Page 27: Big Data Grows Up - A (re)introduction to Cassandra

Requesting too much data

Page 28: Big Data Grows Up - A (re)introduction to Cassandra

really catchy tag line here

Going Native

Page 29: Big Data Grows Up - A (re)introduction to Cassandra

Native protocol

● It’s binary, making it lighter weight● It supports cursors (FTW!)● It supports prepared statements● Cluster awareness built-in● Either synchronous or asynchronous ops● Only supports CQL-based operations● Can be used side-by-side with Thrift

Page 30: Big Data Grows Up - A (re)introduction to Cassandra

Native drivers

from DataStax:JavaC#Python

… other community supported drivers available

Page 31: Big Data Grows Up - A (re)introduction to Cassandra

Native query exampleval insert = session.prepare("INSERT INTO myKsp.myTable (myKey, col1, col2) VALUES (?,?,?)")val select = session.prepare("SELECT * FROM myKsp.myTable WHERE myKey = ?")val cluster = Cluster.builder().addContactPoints(host1, host2, host3)val session = cluster.connect()session.execute(insert.bind(myKey, col1, col2))val result = session.execute(select.bind(myKey))

Page 32: Big Data Grows Up - A (re)introduction to Cassandra

Or, how to make Cassandra more awesome while simultaneously irritating early adopters

Wait, was that SQL?!!

Page 33: Big Data Grows Up - A (re)introduction to Cassandra

Introducing CQL3

● Because the first two attempts sucked● Stands for “Cassandra Query Language”● Looks a heck of a lot like SQL● … but isn’t● Substantially lowers the learning curve● … but also makes it easier to screw up● An abstraction over the storage rows

Page 34: Big Data Grows Up - A (re)introduction to Cassandra

Storage rows[default@unknown] create keyspace Library;[default@unknown] use Library;[default@Library] create column family Books... with comparator=UTF8Type... and key_validation_class=UTF8Type… and default_validation_class=UTF8Type;[default@Library] set Books['Patriot Games']['author'] = 'Tom Clancy';[default@Library] set Books['Patriot Games']['year'] = '1987';[default@Library] list Books;

RowKey: Patriot Games=> (name=author, value=Tom Clancy, timestamp=1393102991499000)=> (name=year, value=1987, timestamp=1393103015955000)

Page 35: Big Data Grows Up - A (re)introduction to Cassandra

Storage rows - composites[default@Library] create column family Authors... with key_validation_class=UTF8Type... and comparator='CompositeType(LongType,UTF8Type,UTF8Type)'... and default_validation_class=UTF8Type;[default@Library] set Authors['Tom Clancy']['1987:Patriot Games:publisher'] = 'Putnam';[default@Library] set Authors['Tom Clancy']['1987:Patriot Games:ISBN'] = '0-399-13241-4';[default@Library] set Authors['Tom Clancy']['1993:Without Remorse:publisher'] = 'Putnam';[default@Library] set Authors['Tom Clancy']['1993:Without Remorse:ISBN'] = '0-399-13825-0';[default@Library] list Authors;

RowKey: Tom Clancy=> (name=1987:Patriot Games:ISBN, value=0-399-13241-4, timestamp=1393104011458000)=> (name=1987:Patriot Games:publisher, value=Putnam, timestamp=1393103948577000)=> (name=1993:Without Remorse:ISBN, value=0-399-13825-0, timestamp=1393104109214000)=> (name=1993:Without Remorse:publisher, value=Putnam, timestamp=1393104083773000)

Page 36: Big Data Grows Up - A (re)introduction to Cassandra

CQL - simple introcqlsh> CREATE KEYSPACE Library WITH REPLICATION = {'class':'SimpleStrategy', 'replication_factor':1};cqlsh> use Library;cqlsh:library> CREATE TABLE Books ( ... title varchar, ... author varchar, ... year int, ... PRIMARY KEY (title) ... );cqlsh:library> INSERT INTO Books (title, author, year) VALUES ('Patriot Games', 'Tom Clancy', 1987);cqlsh:library> INSERT INTO Books (title, author, year) VALUES ('Without Remorse', 'Tom Clancy', 1993);

Page 37: Big Data Grows Up - A (re)introduction to Cassandra

CQL - simple intro

Storage rows:

Page 38: Big Data Grows Up - A (re)introduction to Cassandra

CQL - composite keyCREATE TABLE Authors (

name varchar,year int,title varchar,publisher varchar,ISBN varchar,PRIMARY KEY (name, year, title)

)

Page 39: Big Data Grows Up - A (re)introduction to Cassandra

CQL - composite key

Storage rows:

Page 40: Big Data Grows Up - A (re)introduction to Cassandra

Keys and Filters

● Ad hoc queries are NOT supported● Query by key● Key must include all potential filter columns● Must include partition key in filter● Subsequent filters must be in order● Only last filter can be a range

Page 41: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (title))

Page 42: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (author, title))

Page 43: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (author, year))

Page 44: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (year, author))

Page 45: Big Data Grows Up - A (re)introduction to Cassandra

Secondary Indexes

● Allows query-by-value● CREATE INDEX myIdx ON myTable (myCol)● Works well on low cardinality fields● Won’t scale for high cardinality fields● Don’t overuse it -- not a quick fix for a bad

data model

Page 46: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (author))CREATE INDEX Books_year ON Books(year)

Page 47: Big Data Grows Up - A (re)introduction to Cassandra

Composite Partition Keys

● PRIMARY KEY((year, author), title)● Creates a more granular shard key● Can be useful to make certain queries more

efficient, or to better distribute data● Updates sharing a partition key are atomic

and isolated

Page 48: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY ((year, author), title))

Page 49: Big Data Grows Up - A (re)introduction to Cassandra

Example - Books tableCREATE TABLE Books ( title varchar, author varchar, year int, PRIMARY KEY (year, author, title))

Page 50: Big Data Grows Up - A (re)introduction to Cassandra

denormalization done well

Collections

Page 51: Big Data Grows Up - A (re)introduction to Cassandra

Supported types

● Sets - ordered naturally● Lists - ordered by index● Maps - key/value pairs

Page 52: Big Data Grows Up - A (re)introduction to Cassandra

Caveats

● Max 64k items in a collection● Max 64k size per item● Collections are read in their entirety, so keep

them small

Page 53: Big Data Grows Up - A (re)introduction to Cassandra

Sets

Page 54: Big Data Grows Up - A (re)introduction to Cassandra

Sets

Set name

Itemvalue

Page 55: Big Data Grows Up - A (re)introduction to Cassandra

Lists

Page 56: Big Data Grows Up - A (re)introduction to Cassandra

Lists

List name Ordering meta data

List item value

Page 57: Big Data Grows Up - A (re)introduction to Cassandra

Maps

Page 58: Big Data Grows Up - A (re)introduction to Cassandra

Maps

Map name

Key Value

Page 59: Big Data Grows Up - A (re)introduction to Cassandra

(tracing on)

TRON

Page 60: Big Data Grows Up - A (re)introduction to Cassandra

Using tracing

● In cqlsh, “tracing on”● … enjoy!

Page 61: Big Data Grows Up - A (re)introduction to Cassandra

Example1393126200000

Page 62: Big Data Grows Up - A (re)introduction to Cassandra

AntipatternCREATE TABLE WorkQueue ( name varchar, time bigint, workItem varchar, PRIMARY KEY (name, time))

… do a bunch of inserts ...SELECT * FROM WorkQueue WHERE name='ToDo' ORDER BY time ASC;DELETE FROM WorkQueue WHERE name=’ToDo’ AND time=[some_time]

Page 63: Big Data Grows Up - A (re)introduction to Cassandra

Antipattern - enqueue

Page 64: Big Data Grows Up - A (re)introduction to Cassandra

Antipattern - dequeue

Page 65: Big Data Grows Up - A (re)introduction to Cassandra

Antipattern

20k tombstones!! 13ms of 17ms spent reading tombstones

Page 66: Big Data Grows Up - A (re)introduction to Cassandra

(no it’s not ACID)

Lightweight Transactions

Page 67: Big Data Grows Up - A (re)introduction to Cassandra

Primer

● Supports basic Compare-and-Set ops● Provides linearizable consistency● … aka serial isolation● Uses “Paxos light” under the hood● Still expensive -- four round trips!● For most cases quorum reads/writes will be

sufficient

Page 68: Big Data Grows Up - A (re)introduction to Cassandra

UsageINSERT INTO Users (login, name)VALUES (‘rs_atl’, ‘Robbie Strickland’)IF NOT EXISTS;

UPDATE UsersSET password=’super_secure_password’WHERE login=’rs_atl’IF reset_token=’some_reset_token’;

Page 69: Big Data Grows Up - A (re)introduction to Cassandra

Other cool stuff

● Triggers (experimental)● Batching multiple requests● Leveled compaction● Configuration via CQL● Gossip-based rack/DC configuration

Page 70: Big Data Grows Up - A (re)introduction to Cassandra

Thank you!

Robbie StricklandSoftware Development ManagerThe Weather Channel

[email protected]@dont_use_twitter


Top Related