cassandra data modeling

CASSANDRA DATA MODELING

INTRODUCTION

A little bit about me, a little bit more about AddThis

WHY CASSANDRA?▸ Scalability▸ Fault Tolerant

▸ Optimized for Big Data▸ Queryable

QUERYABLE, NOW WE'RE TALKIN!

STORING DATA IN CASSANDRA

▸ Before we can talk about building a house, you have to know what kind of tools you have.▸ Before you can cook a meal,

you need to know what ingredients you have. ▸ Before you can data model with Cassandra,

you need to know what types of data you can store.

BASIC TYPES

CQL Type | Constants | Description -------- | --------- | ----------- ascii | strings | US-ASCII character string bigint | integers | 64-bit signed long blob | blob | Arbitrary bytes boolean | booleans | true or false counter | integers | Distributed counter value (64-bit long) decimal | ints/floats| Variable-precision decimal float | ints/floats| 32-bit floating point inet | strings | IP address string int | integers | 32-bit signed integer text | strings | UTF-8 encoded string timestamp| ints/strings | Date plus time timeuuid | uuids | Type 1 UUID only tuple | n/a | A group of 2-3 fields (Cassandra 2.1+) uuid | uuids | A UUID in standard UUID format varchar | strings | UTF-8 encoded string varint | integers | Arbitrary-precision integer

HOW CASSANDRA STORES DATATABLE CONSTRUCT

CREATE TABLE users ( user_id uuid, first_name text, last_name text, company text, PRIMARY KEY (user_id));

COLLECTIONS

SETSemails set<text>

--- insert set into table INSERT INTO users (user_id, first_name, last_name, emails) VALUES(uuid(), 'Ben', 'Knear', {'[email protected]', '[email protected]'});

--- add to set, even if set was never instantiated --- note, it will re-sort the collection after adding UPDATE users SET emails = emails + {'[email protected]'} WHERE user_id = X;

--- remove from set UPDATE users SET emails = emails - {'[email protected]'} WHERE user_id = X;

LISTSpriority_emails list<text>

--- insert list into table INSERT INTO users (user_id, first_name, last_name, priority_emails) VALUES(uuid(), 'Ben', 'Knear', ['[email protected]', '[email protected]']);

--- add to list, even if list was never instantiated UPDATE users SET priority_emails = priority_emails + ['[email protected]'] WHERE user_id = X;

--- remove from list UPDATE users SET priority_emails = priority_emails - ['[email protected]'] WHERE user_id = X;

MAPScontact_info map<text, text>

--- insert map into table INSERT INTO users (user_id, first_name, last_name, contact_info) VALUES(uuid(), 'Ben', 'Knear', { 'work_email' : '[email protected]', 'home_email' : '[email protected]' });

--- delete from a map DELETE contact_info['work_email'] FROM users WHERE user_id = X;

MAPScontact_info map<text, text>

--- add to map, even if map was never instantiated UPDATE users SET contact_info['other_email'] = '[email protected]' WHERE user_id = X;

--- remove from map UPDATE users SET contact_info = contact_info - ['[email protected]'] WHERE user_id = X;

JSON OR MAP

▸ Remember all values in a Map must be the same type▸ What will you do with the value?

▸ Remember: Values of items in collections are limited to 64K.

HOW CASSANDRA STORES DATARow Keys, Column Families

| row key | columns | |---------|----------------------------------------| | | "first_name" | "last_name" | "company" | | UUID | 'Ben' | 'Knear' | 'AddThis' | |---------|----------------------------------------|

CASSANDRA LIMITS

▸ Maximum number of columns per row is 2 billion.▸ Maximum size for the name of a column is 64 KB.▸ Maximum size for a value in a column is 2 GB.▸ Collection values may not be larger than 64 KB.

APPROACHING DATA MODEL

You must ask:▸ What do we want to store?

▸ What are the relationships within the data?▸ How do we plan to access it?

In a relational model, you would focus most on the questions one and two. But for Cassandra, you must focus most on the third.

You must ask:▸ What do we want to store?

▸ What are the relationships within the data?▸ How do we plan to access it?

In a relational model, you would focus most on the questions one and two. But for Cassandra, you must focus most on the third.

Cassandra data modeling starts with how will you use the data.

Denormalization

Optimizing the read performance of a database by adding redundant data or by grouping data. In Cassandra, this process is accomplished by

duplicating data in multiple tables, grouping data for queries.

So how will we get data from Cassandra?

CASSANDRA QUERYING

PRIMARY KEY

Defines a unique value to identify the row, and also drives partitioning.

When multiple columns are defined in the primary key, the first column defines the partition key, the rest are clustered columns.

Partitioning is important because rows and columns are grouped on nodes, and grouped data is read and written faster.

EXAMPLE

CREATE TABLE movies ( id uuid, name text, genre text, cast set<uuid>, company text, PRIMARY KEY (id));

Good if I only know the ID when I retrieve

COMPOSITE PRIMARY KEY ALTERNATIVE

CREATE TABLE movies ( id uuid, name text, genre text, cast set<uuid>, company text, PRIMARY KEY (genre, id));

Good if I know the genre and ID when I retrieve, or to get all of a genre

QUERYING EXAMPLESCREATE TABLE user_addresses ( state text, city text, username text, address text, PRIMARY KEY (state, city, username));

-- insert a valueINSERT INTO user_addresses (state, city, username, address) VALUES ('VA', 'Vienna', 'AddThis', 'Spring Hill Rd');

-- FAILURESSELECT * FROM user_addresses WHERE city = 'Vienna';SELECT * FROM user_addresses WHERE city = 'Vienna' AND username = 'AddThis';

-- SUCCESSESSELECT * FROM user_addresses WHERE state = 'VA';SELECT * FROM user_addresses WHERE state = 'VA' AND city = 'Vienna';SELECT * FROM user_addresses WHERE state = 'VA' AND city = 'Vienna' AND username = 'AddThis';

ERROR

Filtering by just clustering columns will give you this response:

Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to

execute this query despite the performance unpredictability, use ALLOW FILTERING

WORK AROUND

Expensive query, so use LIMIT if you must: SELECT * FROM user_addresses WHERE username = 'AddThis' LIMIT 10 ALLOW FILTERING;

At absolute worst (there are no addresses with username = 'AddThis'), it will have to look through the entire table.

COMPOSITE PARTITION KEYS

Composite partition keys will group the first value on the same node, though the second value may be on a different node.

CREATE TABLE cars ( lot_id int, make text, model text, color text, PRIMARY KEY ((lot_id, make), model) );

INSERT INTO cars (lot_id, make, model, color) VALUES (1, 'Ford', 'Explorer', 'Black'); INSERT INTO cars (lot_id, make, model, color) VALUES (1, 'Cadillac', 'CT6', 'Black'); INSERT INTO cars (lot_id, make, model, color) VALUES (2, 'BMW', 'M8', 'Red');

COMPOSITE PARTITION KEYS-- FAILSSELECT * FROM cars WHERE lot_id = 1;SELECT * FROM cars WHERE make = 'Ford';

-- SUCCESSSELECT * FROM cars WHERE lot_id = 1 AND make = 'Ford';

Generally, Cassandra will store columns having the same lot id but a different make on different nodes, and columns having the same lot id

and make on the same node.

INDEXES

You can also add indices at any time:CREATE INDEX genre_idx ON movie (genre);

Or even on a collection field (Cassandra 2.1+)CREATE INDEX cast_idx ON movie (cast);

But for maps, you can index either the keys or the values.CREATE INDEX ON users (general);

CREATE INDEX ON users (KEYS(general));

RULES FOR INDEXES

Similar to relational databases, the more unique values in the index, the larger it'll be, and the longer it'll take to read it.

RULES FOR INDEXES

▸ Do not index counter columns. ▸ Do not index high cardinality columns.

▸ Do not index on a frequently updated or deleted column (tombstone issues)

▸ Do not index on a largely partitioned field (which requires communicating with more servers to retrieve the information)

SIDEBAR ON TOMBSTONES

Tombstones are relics from deleted values, used in data replication.▸ Grace period for garbage collection

▸ Avoid nulls

FILTERING WITH INDEX

Indexes on a basic data column filters the same as a primary key.

For collections you will use CONTAINS:SELECT * FROM users WHERE email_addresses CONTAINS '[email protected]';SELECT * FROM users WHERE user_attr_map CONTAINS 'Software Engineer';SELECT * FROM users WHERE user_attr_map CONTAINS KEY 'Job Title';

RANGE QUERIES

Especially useful for timeseries tables, you can select a range of valuesSELECT * FROM weather_reports WHERE report_time >= '2014-05-17 00:00:00-0000' AND report_time < '2014-05-18 00:00:00-0000';

DATA MODEL WORKSHOP

REFERENCING Netflix

USER QUEUE

RELATIONAL MODEL

▸ User table with auto-inc ID▸ Movie table with auto-inc ID

▸ User_Movie table with auto-inc ID, foreign keys user_id and movie_id, plus a timestamp

RELATIONAL MODEL

▸ User table with auto-inc ID▸ Movie table with auto-inc ID

▸ User_Movie table with auto-inc ID, foreign keys user_id and movie_id, plus a timestamp

Completely inefficient for Cassandra

BAD PARTITIONINGCREATE TABLE user_queue ( id uuid, user_id uuid, video_id uuid, added timestamp, PRIMARY KEY (id));CREATE INDEX user_id_idx ON user_queue (user_id);

Terrible.

BETTER PARTITIONING

Including the video info as a JSON blobCREATE TABLE user_queue ( user_id uuid, video_id uuid, video_info text, added timestamp, PRIMARY KEY (user_id));

BETTER MODEL

Utilize the columnsCREATE TABLE user_queue ( user_id uuid, queue map<text, text>, PRIMARY KEY (user_id));

▸ queue map will contain 'movieId' -> json blob of movie▸ Read will pull all movies for a user

ADDING TO QUEUEUPDATE TABLE user SET queue['newMovieId'] = 'json about movie'WHERE user_id = X;

VIEW HISTORY

USER REVIEWS

NEW RELEASES

METRICS

THANK YOU

cassandra data modeling

Technology