cassandra community webinar: back to basics with cql3

Back to Basics with CQL3

Matt OverstreetOpenSource Connections

OpenSource Connections

Outline

• Overview• Architecture• Data Modeling• Good At/Bad At• Using Cassandra


Outline



• What is Big Data?• How does Cassandra fit?

What is Big Data?

• The three V’s (and a C)


velocityvolumeVariety

Complexity

What is Big Data

• Brewer’s CAP theoremo Consistency - all nodes have same world viewo Availability - requests can be servicedo Partition tolerance - network/machine failureo Can’t have all 3 -- Pick 2!

• Exampleso MySQL – Consistent, Availableo HBase – Consistent, Partition Toleranto Cassandra – Available, Partition Tolerant

– and “Tunably Consistent”!


What is Big Data?

• Common theme: Denormalize everything!o What’s that?

• JOIN all the tables in the database...• … well not all the tables

o Why?• You can shard database at any point• All related data is co-located

• What this means for youo No joinso No transactions - potential for inconsistencyo Vastly simplified queryingo No data-modeling -- Instead, query-modelingo “Infinite and easy” scaling potential


How Does Cassandra Fit?

• No single point of failure• Optimized for writes, still good with reads• Can decide between Consistency and Availably

concerns


Outline



• Ring architecture• Data partitioning

o Operationso Writeso Reads

Ring Architecture

• No single point of failure• Nodes talk via gossip• Democratic - all nodes

are equal


Data Partitioning

Original partitioning method.


Data Partitioning

Flexible partitioning with virtual nodes.


Operations: Writes


Requests sent out to nodes and replicants.

Operations: Reads


Coordinator node reaches out to relevant replicants.

Outline



• Internals• Cassandra Query Language• Modeling Strategy• Example

C* Data Model


Keyspace

C* Data Model


Keyspace

Column Family Column Family

C* Data Model


Row Key

C* Data Model


Row Key

Column Name

Column Value (or Tombstone)

Timestamp

Time-to-live

Column

C* Data Model


● Row Key, Column Name, Column Value have types

● Column Name has comparator● RowKey has partitioner● Rows can have any number of

columns - even in same column family

● Rows can have many columns● Column Values can be omitted● Time-to-live is useful!● Tombstones

Row Key

Column Name

Column Value (or Tombstone)

Timestamp

Time-to-live

Column

C* Data Model: Writes


MemTable

CommitLog

Row Cache

● Insert into MemTable

● Dump to CommitLog

● No read● Very Fast!● Blocks on CPU

before O/I!

Key Cache

SSTable

SSTable

SSTable

SSTableKey

CacheKey

CacheKey

Cache

BloomFilter


MemTable

CommitLog

Row Cache

Key Cache

SSTable

SSTable

SSTable

SSTableKey

CacheKey

CacheKey

Cache

BloomFilter

● Get values from Memtable

● Get values from row cache if present

● Otherwise check bloom filter to find appropriate SSTables

● Check Key Cache for fast SSTable Search

● Get values from SSTables● Repopulate Row Cache● Super Fast Col.

retrieval● Fast row slicing

C* Data Model:Reads

Internals: Twitter Example• 4 ColumnFamilies

o followerso followingo tweetso timeline


Internals: Twitter Example• 4 ColumnFamilies

o followerso followingo tweetso timeline

• Nate follows Patriciao SET followers[Patricia][Nate] = ‘’;o SET following[Nate][Patricia] = ‘’;o storing data in column names (not values)o denormalized, redundant!

• Get all Nate’s followerso GET followers[Patricia]o => Nate,Eric,Scott,Matt,Doug,Kateo No JOIN!


Internals: Twitter Example

• Nate tweetso SET tweets[Nate][2013-07-19 T 09:20] = “Wonderful morning. This coffee is great.”

o SET tweets[Nate][2013-07-19 T 09:21] = “Oops, smoke is coming out of the SQL server!”

o SET tweets[Nate][2013-07-19 T 09:51] = “Now my coffee is cold :-(”

• Get Nate’s tweetso GET tweets[Nate]

…(what you’d expect)...


CQL (Cassandra Query

Language)

CREATE TABLE users ( id timeuuid PRIMARY KEY, lastname varchar, firstname varchar, dateOfBirth timestamp );



INSERT INTO users (id,lastname, firstname, dateofbirth) VALUES (now(),'Berryman',’John','1975-09-15');



Language)


INSERT INTO users (id,lastname, firstname, dateofbirth) VALUES (now(),’Berryman’,’John’,’1975-09-15’);

UPDATE users SET firstname = ’John’ WHERE id = f74c0b20-0862-11e3-8cf6-b74c10b01fc6;



Language)


Language)


INSERT INTO users (id,lastname, firstname, dateofbirth) VALUES (now(),'Berryman',’John','1975-09-15');

UPDATE users SET firstname = 'John’ WHERE id = f74c0b20-0862-11e3-8cf6-b74c10b01fc6;

SELECT dateofbirth,firstname,lastname FROM users ;

dateofbirth | firstname | lastname--------------------------+-----------+---------- 1975-09-15 00:00:00-0400 | John | Berryman


The CQL/Cassandra Mapping

CREATE TABLE employees ( company text, name text, age int, role text, PRIMARY KEY (company,name));





company | name | age | role--------+------+-----+-----OSC | eric | 38 | ceoOSC | john | 37 | devRKG | anya | 29 | leadRKG | ben | 27 | devRKG | chad | 35 | ops

Modeling Strategy

• Don’t think about the data structure• Do think of the questions you’ll ask• Consider efficient operations for Cassandra

o Writing (4K writes per second per core)o Retrieving a rowo Retrieving a row sliceo Retrieving in natural order (which you control)

• Write the data in the way you will query it• Disk space is cheap• Seperate read-heavy and write-heavy task

o Make wise use of caches


Modeling Strategy: Anti-Patterns

• Read-then-write• Heavy deletes

o Scatters dead columns throughout SSTableso Won’t be corrected until first compaction after

gc_grace_seconds (10days)

• Distributed queue• JOIN-like behavior• Super wide-row sneak attack (>2B columns)


QUESTIONS?


cassandra community webinar: back to basics with cql3

Technology

live row key

row cache super fast

related data

equalopensource connections

liveopensource connections

concernsopensource connections

retrieval fast row

fast sstable search