cassandra for impatients

Post on 13-Jan-2017

925 Views

Category:

Engineering

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Cassandra for impatientsCarlos Alonso (@calonso)MADRID · NOV 27-28 · 2015

MADRID · NOV 27-28 · 2015@calonso

• Introductions

• Cassandra Core concepts

• CQL

• Data modelling and examples

MADRID · NOV 27-28 · 2015@calonso

CREATE TABLE users ( id INT PRIMARY KEY, name VARCHAR, surname VARCHAR, birthdate TIMESTAMP);

INSERT INTO users (id, name, surname, birthdate) VALUES (1, ‘Carlos’, ‘Alonso’, ’1985-03-19’);

SELECT * FROM users WHERE id = 1;

ALTER TABLE users ADD city VARCHAR;

UPDATE users SET city = ‘London’ WHERE id = 1;

MADRID · NOV 27-28 · 2015@calonso

IMPATIENT LEVEL…

Aha!! That’s SQL, I already know all this stuff. I’ll start using Cassandra as soon as I get home.

MADRID · NOV 27-28 · 2015@calonso

IMPATIENT LEVEL…

Why is this taking so long? Have they gone killing the cow for my BigMac?

MADRID · NOV 27-28 · 2015@calonso

IMPATIENT LEVEL…As soon as I see the loading spinner…

MADRID · NOV 27-28 · 2015@calonso

CARLOS ALONSO•Spanish Londoner

•MSc Salamanca University, Spain

•Software Engineer @MyDrive Solutions

•Cassandra certified developer

•Cassandra MVP 2015

•@calonso / http://mrcalonso.com

MADRID · NOV 27-28 · 2015@calonso

MYDRIVE SOLUTIONS

• World leading driver profiling company

• Using technology and data to understand how to improve driving behaviour

• Recently acquired by the Generali Group

• @_MyDrive / http://mydrivesolutions.com

• We are hiring!!

MADRID · NOV 27-28 · 2015@calonso

CASSANDRA• Open Source distributed database

management system

• Initially developed at Facebook

• Inspired by Amazon’s Dynamo and Google BigTable papers

• Became Apache top-level project in Feb, 2010

• Nowadays developed by DataStax

Cassandra Core ConceptsTechnical introduction to Apache CassandraMADRID · NOV 27-28 · 2015

MADRID · NOV 27-28 · 2015@calonso

THE CAP THEOREM

MADRID · NOV 27-28 · 2015@calonso

CASSANDRA• Fast Distributed NoSQL Database

• High Availability

• Linear Scalability => Predictability

• No SPOF

• Multi-DC

• Horizontally scalable => $$$

• Not a drop in replacement for RDBMS

MADRID · NOV 27-28 · 2015@calonso

CLUSTER COMPONENTS

MADRID · NOV 27-28 · 2015@calonso

CLUSTER COMPONENTS

MADRID · NOV 27-28 · 2015@calonso

CONSISTENT HASHING

Hash function

“Carlos” 185664

1773456738847666528349

-894763734895827651234

MADRID · NOV 27-28 · 2015@calonso

REQUEST COORDINATION

• Coordinator: The node designated to coordinate a particular query.

• ANY node can coordinate ANY request.

• No SPOF: One of the main Cassandra’s principles.

• The driver chooses which node will coordinate

MADRID · NOV 27-28 · 2015@calonso

REPLICATION FACTOR

How many copies (replicas) for your data

MADRID · NOV 27-28 · 2015@calonso

WHY REPLICATION?

• Disaster recovery

• Bring data closer to users (to reduce latencies)

• Workload segregation (analytical vs transactional)

MADRID · NOV 27-28 · 2015@calonso

REPLICATIONDefined at keyspace level

CREATE KEYSPACE <my_keyspace> WITH REPLICATION = { “class”: “SimpleStrategy”, “replication_factor”: 2 };

CREATE KEYSPACE <my_keyspace> WITH REPLICATION = { “class”: “NetworkTopologyStrategy”,

“dc-east”: 2, “dc-west”: 3 };

MADRID · NOV 27-28 · 2015@calonso

CONSISTENCY LEVEL

How many replicas of your data must respond ok?

MADRID · NOV 27-28 · 2015@calonso

CONSISTENCY LEVEL

Availability /Partition tolerance Consistency

MADRID · NOV 27-28 · 2015@calonso

DriverClient

Partitioner

f81d4fae-…834

• RF = 3• CL = QUORUM• SELECT * … WHERE id = f81d4fae-…

MADRID · NOV 27-28 · 2015@calonso

COMPACTIONS• Merges most recent partition keys and columns

• Evicts tombstoned columns

• Creates new SSTable

• Rebuilds partition indexes and summaries

• Deletes old SSTables

MADRID · NOV 27-28 · 2015@calonso

COMPACTIONS

• SizeTieredCompactionStrategy

• LeveledCompactionStrategy

• DateTieredCompactionStrategy

CREATE TABLE user (id UUID PRIMARY KEY,email TEXT,name TEXT,preferences SET<TEXT>,

) WITH COMPACTION = { “class”: “<strategy>”, <params> };

CQLCassandra Query LanguageMADRID · NOV 27-28 · 2015

MADRID · NOV 27-28 · 2015@calonso

CQL

CREATE TABLE users ( id UUID, name VARCHAR, surname VARCHAR, birthdate TIMESTAMP, PRIMARY KEY(id));

Familiar row-column SQL-like approach.

INSERT INTO users (id, name, surname, birthdate) VALUES (uuid(), ‘Carlos’, ‘Alonso’, ’1985-03-19’);

SELECT * FROM users WHERE id = ‘f81d4fae-7dec-11d0-a765-00a0c91e6bf6’;

ALTER TABLE users ADD address VARCHAR;

MADRID · NOV 27-28 · 2015@calonso

WHERE• Equality matches:

• SELECT … WHERE album_title = ‘Revolver’;

• IN:

• Only applicable in the last WHERE clause

• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);

• Range search:

• Only on clustering columns.

• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;

MADRID · NOV 27-28 · 2015@calonso

COLLECTIONS

• Set: Uniqueness

• List: Order

• Map: Key-Value pairs

• Tuple: Fixed length

email_addresses SET<VARCHAR>

email_addresses LIST<VARCHAR>

email_addresses MAP<VARCHAR, VARCHAR>

email_addresses TUPLE<VARCHAR, VARCHAR>

MADRID · NOV 27-28 · 2015@calonso

MORE FEATURES

• COUNTERS

• LWT

• TTL

• UDT

• BATCHES

• BATCH + LWT

• STATIC COLUMNS

Data ModellingBasic concepts for a Cassandra modelMADRID · NOV 27-28 · 2015

MADRID · NOV 27-28 · 2015@calonso

PHYSICAL DATA STRUCTURE

MADRID · NOV 27-28 · 2015@calonso

DATA MODELLING

• Understand your data

• Decide how you’ll query the data

• Define column families to satisfy those queries

• Implement and optimize

MADRID · NOV 27-28 · 2015@calonso

DATA MODELLINGConceptual

Model

Logical Model

Physical Model

Query-DrivenMethodology

Analysis &Validation

MADRID · NOV 27-28 · 2015@calonso

PRIMARY KEY

PARTITION KEY +CLUSTERING COLUMN(S)

MADRID · NOV 27-28 · 2015@calonso

QUERY DRIVEN METHODOLOGY• Spread data evenly around the cluster

• Minimise the number of partitions read

• Follow the mapping rules:

• Entities and relationships: map to tables

• Equality search attributes: must be at the beginning of the primary key

• Inequality search attributes: become clustering columns

• Ordering attributes: become clustering columns

• Key attributes: map to primary key columns

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• How much data duplication? (batches)

• Client side joins or new table?

MADRID · NOV 27-28 · 2015@calonso

An E-Library project.

MADRID · NOV 27-28 · 2015@calonso

EXAMPLE 1Books can be uniquely identified and accessed by ISBN,

we also need a title, genre, author and publisher.

QDMBook

ISBN KTitle

AuthorGenre

Publisher

Q1

Q1: Find books by ISBN

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• 1 X (5 - 1 - 0) + 0 < 1M

• How much data duplication? (batches)

• Client side joins or new table?

BookISBN KTitle

AuthorGenre

Publisher

Q1

Q1: Find books by ISBN

N/A

N/A

MADRID · NOV 27-28 · 2015@calonso

BookISBN KTitle

AuthorGenre

Publisher

Q1

Q1: Find books by ISBNCREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR);

SELECT * FROM books WHERE ISBN = ‘…’;

MADRID · NOV 27-28 · 2015@calonso

Users register into the system uniquely identified by an email and a password. We also want their full name. They will be

accessed by email and password or internal unique ID.

QDM KUsers_by_IDID

full_name

Q1

Q1: Find users by ID

Users_by_login_infoemail K

password

Q2

Q2: Find users by login info

K

full_name

ID

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• 1 X (2 - 1 - 0) + 0 < 1M

• How much data duplication? (batches)

• Client side joins or new table?

N/A

N/A

KUsers_by_IDID

full_name

Q1

Q1: Find users by ID

MADRID · NOV 27-28 · 2015@calonso

CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR);

SELECT * FROM users_by_id WHERE ID = …;

KUsers_by_IDID

full_name

Q1

Q1: Find users by ID

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• 1 X (4 - 2 - 0) + 0 < 1M

• How much data duplication? (batches)

• Client side joins or new table? N/A

Users_by_login_infoemail K

password

Q2

Q2: Find users by login info

K

full_name

ID

MADRID · NOV 27-28 · 2015@calonso

CREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY_KEY ((email, password)));

SELECT * FROM users_by_login_info WHERE email = ‘…’ AND password = ‘…’;

Users_by_login_infoemail K

password

Q2

Q2: Find users by login info

K

full_name

ID

MADRID · NOV 27-28 · 2015@calonso

BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…);APPLY BATCH;

MADRID · NOV 27-28 · 2015@calonso

Users read books. We want to know which books has a user read.

QDM

KBooks_read_by_user

user_IDtitle

Q1

full_nameISBNgenre

publisher

authorCC

Q1: Find all books a loggeduser has read

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• Books X (7 - 1 - 0) + 0 < 1M => 166.666 books per user

• How much data duplication? (batches)

• Client side joins or new table?

Q1: Find all books a loggeduser has read

KBooks_read_by_user

user_IDtitle

Q1

full_nameISBNgenre

publisher

authorCC

New table

N/A

MADRID · NOV 27-28 · 2015@calonso

CREATE TABLE books_read_by_user ( user_ID TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR PRIMARY_KEY (user_id, title, author));

SELECT * FROM books_read_by_user WHERE user_ID = …;

Q1: Find all books a loggeduser has read

KBooks_read_by_user

user_IDtitle

Q1

full_nameISBNgenre

publisher

authorCC

MADRID · NOV 27-28 · 2015@calonso

In order to improve our site’s usability we need to understand how our users use it by tracking every interaction

they have with our site.

QDM

user_ID2_weeks

timeelement

type

KActions_by_user

Q1

Q1: Find all actions a user does in a time range

KC

MADRID · NOV 27-28 · 2015@calonso

ANALYSIS & VALIDATION• 1 Partition per read?

• Are write conflicts (overwrites) possible?

• How large are partitions?

• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M

• Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks

• How much data duplication? (batches)

• Client side joins or new table?

N/A

KActions_by_user

user_ID2_weeks

Q1

Q1: Find all actions a user does in a time range

Ktime

elementtype

C

N/A

MADRID · NOV 27-28 · 2015@calonso

CREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR PRIMARY_KEY ((user_ID, 2_weeks), time));

SELECT * FROM actions_by_user WHERE user_ID = … AND 2_weeks = … AND

time < … AND time > …;

KActions_by_user

user_ID2_weeks

Q1

Q1: Find all actions a user does in a time range

Ktime

elementtype

C

MADRID · NOV 27-28 · 2015@calonso

THANKS!

top related