cassandra for impatients

Cassandra for impatientsCarlos Alonso (@calonso)MADRID · NOV 27-28 · 2015

MADRID · NOV 27-28 · 2015@calonso

• Introductions

• Cassandra Core concepts

• CQL

• Data modelling and examples

CREATE TABLE users ( id INT PRIMARY KEY, name VARCHAR, surname VARCHAR, birthdate TIMESTAMP);

INSERT INTO users (id, name, surname, birthdate) VALUES (1, ‘Carlos’, ‘Alonso’, ’1985-03-19’);

SELECT * FROM users WHERE id = 1;

ALTER TABLE users ADD city VARCHAR;

UPDATE users SET city = ‘London’ WHERE id = 1;

IMPATIENT LEVEL…

Aha!! That’s SQL, I already know all this stuff. I’ll start using Cassandra as soon as I get home.

IMPATIENT LEVEL…

Why is this taking so long? Have they gone killing the cow for my BigMac?

IMPATIENT LEVEL…As soon as I see the loading spinner…

CARLOS ALONSO•Spanish Londoner

•MSc Salamanca University, Spain

•Software Engineer @MyDrive Solutions

•Cassandra certified developer

•Cassandra MVP 2015

•@calonso / http://mrcalonso.com

MYDRIVE SOLUTIONS

• World leading driver profiling company

• Using technology and data to understand how to improve driving behaviour

• Recently acquired by the Generali Group

• @_MyDrive / http://mydrivesolutions.com

• We are hiring!!

CASSANDRA• Open Source distributed database

management system

• Initially developed at Facebook

• Inspired by Amazon’s Dynamo and Google BigTable papers

• Became Apache top-level project in Feb, 2010

• Nowadays developed by DataStax

Cassandra Core ConceptsTechnical introduction to Apache CassandraMADRID · NOV 27-28 · 2015

THE CAP THEOREM

CASSANDRA• Fast Distributed NoSQL Database

• High Availability

• Linear Scalability => Predictability

• No SPOF

• Multi-DC

• Horizontally scalable => $$$

• Not a drop in replacement for RDBMS

CLUSTER COMPONENTS

CONSISTENT HASHING

Hash function

“Carlos” 185664

1773456738847666528349

-894763734895827651234

REQUEST COORDINATION

• Coordinator: The node designated to coordinate a particular query.

• ANY node can coordinate ANY request.

• No SPOF: One of the main Cassandra’s principles.

• The driver chooses which node will coordinate

REPLICATION FACTOR

How many copies (replicas) for your data

WHY REPLICATION?

• Disaster recovery

• Bring data closer to users (to reduce latencies)

• Workload segregation (analytical vs transactional)

REPLICATIONDefined at keyspace level

CREATE KEYSPACE <my_keyspace> WITH REPLICATION = { “class”: “SimpleStrategy”, “replication_factor”: 2 };

CREATE KEYSPACE <my_keyspace> WITH REPLICATION = { “class”: “NetworkTopologyStrategy”,

“dc-east”: 2, “dc-west”: 3 };

CONSISTENCY LEVEL

How many replicas of your data must respond ok?

CONSISTENCY LEVEL

Availability /Partition tolerance Consistency

DriverClient

Partitioner

f81d4fae-…834

• RF = 3• CL = QUORUM• SELECT * … WHERE id = f81d4fae-…

COMPACTIONS• Merges most recent partition keys and columns

• Evicts tombstoned columns

• Creates new SSTable

• Rebuilds partition indexes and summaries

• Deletes old SSTables

COMPACTIONS

• SizeTieredCompactionStrategy

• LeveledCompactionStrategy

• DateTieredCompactionStrategy

CREATE TABLE user (id UUID PRIMARY KEY,email TEXT,name TEXT,preferences SET<TEXT>,

) WITH COMPACTION = { “class”: “<strategy>”, <params> };

CQLCassandra Query LanguageMADRID · NOV 27-28 · 2015

CREATE TABLE users ( id UUID, name VARCHAR, surname VARCHAR, birthdate TIMESTAMP, PRIMARY KEY(id));

Familiar row-column SQL-like approach.

INSERT INTO users (id, name, surname, birthdate) VALUES (uuid(), ‘Carlos’, ‘Alonso’, ’1985-03-19’);

SELECT * FROM users WHERE id = ‘f81d4fae-7dec-11d0-a765-00a0c91e6bf6’;

ALTER TABLE users ADD address VARCHAR;

WHERE• Equality matches:

• SELECT … WHERE album_title = ‘Revolver’;

• IN:

• Only applicable in the last WHERE clause

• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);

• Range search:

• Only on clustering columns.

• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;

COLLECTIONS

• Set: Uniqueness

• List: Order

• Map: Key-Value pairs

• Tuple: Fixed length

email_addresses SET<VARCHAR>

email_addresses LIST<VARCHAR>

email_addresses MAP<VARCHAR, VARCHAR>

email_addresses TUPLE<VARCHAR, VARCHAR>

MORE FEATURES

• COUNTERS

• LWT

• TTL

• UDT

• BATCHES

• BATCH + LWT

• STATIC COLUMNS

Data ModellingBasic concepts for a Cassandra modelMADRID · NOV 27-28 · 2015

PHYSICAL DATA STRUCTURE

DATA MODELLING

• Understand your data

• Decide how you’ll query the data

• Define column families to satisfy those queries

• Implement and optimize

DATA MODELLINGConceptual

Logical Model

Physical Model

Query-DrivenMethodology

Analysis &Validation

PRIMARY KEY

PARTITION KEY +CLUSTERING COLUMN(S)

QUERY DRIVEN METHODOLOGY• Spread data evenly around the cluster

• Minimise the number of partitions read

• Follow the mapping rules:

• Entities and relationships: map to tables

• Equality search attributes: must be at the beginning of the primary key

• Inequality search attributes: become clustering columns

• Ordering attributes: become clustering columns

• Key attributes: map to primary key columns

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• How much data duplication? (batches)

• Client side joins or new table?

An E-Library project.

EXAMPLE 1Books can be uniquely identified and accessed by ISBN,

we also need a title, genre, author and publisher.

QDMBook

ISBN KTitle

AuthorGenre

Publisher

Q1: Find books by ISBN

• 1 X (5 - 1 - 0) + 0 < 1M

BookISBN KTitle

AuthorGenre

Publisher

Q1: Find books by ISBN

BookISBN KTitle

AuthorGenre

Publisher

Q1: Find books by ISBNCREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR);

SELECT * FROM books WHERE ISBN = ‘…’;

Users register into the system uniquely identified by an email and a password. We also want their full name. They will be

accessed by email and password or internal unique ID.

QDM KUsers_by_IDID

full_name

Q1: Find users by ID

Users_by_login_infoemail K

password

Q2: Find users by login info

full_name

• 1 X (2 - 1 - 0) + 0 < 1M

KUsers_by_IDID

full_name

CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR);

SELECT * FROM users_by_id WHERE ID = …;

KUsers_by_IDID

full_name

• 1 X (4 - 2 - 0) + 0 < 1M

• Client side joins or new table? N/A

password

full_name

CREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY_KEY ((email, password)));

SELECT * FROM users_by_login_info WHERE email = ‘…’ AND password = ‘…’;

password

full_name

BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…);APPLY BATCH;

Users read books. We want to know which books has a user read.

KBooks_read_by_user

user_IDtitle

full_nameISBNgenre

publisher

authorCC

Q1: Find all books a loggeduser has read

• Books X (7 - 1 - 0) + 0 < 1M => 166.666 books per user

KBooks_read_by_user

user_IDtitle

full_nameISBNgenre

publisher

authorCC

New table

CREATE TABLE books_read_by_user ( user_ID TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR PRIMARY_KEY (user_id, title, author));

SELECT * FROM books_read_by_user WHERE user_ID = …;

KBooks_read_by_user

user_IDtitle

full_nameISBNgenre

publisher

authorCC

In order to improve our site’s usability we need to understand how our users use it by tracking every interaction

they have with our site.

user_ID2_weeks

timeelement

KActions_by_user

Q1: Find all actions a user does in a time range

• Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks

KActions_by_user

user_ID2_weeks

elementtype

CREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR PRIMARY_KEY ((user_ID, 2_weeks), time));

SELECT * FROM actions_by_user WHERE user_ID = … AND 2_weeks = … AND

time < … AND time > …;

KActions_by_user

user_ID2_weeks

elementtype

THANKS!

cassandra for impatients

Engineering

yajug - cassandra for java developers

cassandra for the relational brain - percona · a little...

state of cassandra, 2012 - nosql | apache cassandra ·...

datastax-welcome to cassandra - odbms.org€¦ · what is...

paris cassandra meetup - cassandra for developers

cassandra summit 2014: cassandra compute cloud: an elastic...

la cassandra day 2015 - cassandra for developers

a cassandra data model for serving up cat videos (from...

cassandra as an event sourced journal for big data analytics...

cassandra community webinar: apache cassandra internals

introduction to cassandra • why spark + cassandra ... ·...

distributed data with cassandra...

cassandra core concepts - cassandra day toronto

gorm for cassandra

cassandra day sv 2014: netflix’s astyanax java client...

running cassandra on amazon’s ecs -...

cassandra for mission critical data

cassandra at ebay - cassandra summit 2013

cassandra day chicago 2015: cassandra consistency &...

cql for cassandra 2 -...