cassandra scalable data modelling by example
TRANSCRIPT
![Page 1: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/1.jpg)
Scalable data modelling by example
Carlos Alonso (@calonso)
![Page 2: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/2.jpg)
Carlos Alonso
• Ex-Spanish Londoner
• MSc Salamanca University, Spain
• Software Engineer @MyDrive Solutions
• Cassandra certified developer
• Cassandra MVP 2015
• @calonso / http://mrcalonso.com
![Page 3: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/3.jpg)
MyDrive Solutions
❖ World leading driver profiling company
❖ Using technology and data to understand how
to improve driving behaviour
❖ Recently acquired by the Generali Group
❖ @_MyDrive / http://mydrivesolutions.com
❖ We are hiring!!
![Page 4: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/4.jpg)
Cassandra• Open Source distributed database
management system
• Initially developed at Facebook
• Inspired by Amazon’s Dynamo and Google BigTable papers
• Became Apache top-level project in Feb, 2010
• Nowadays developed by DataStax
![Page 5: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/5.jpg)
The data model is the only thing you can’t change once in production.
![Page 6: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/6.jpg)
Cassandra Concepts
![Page 7: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/7.jpg)
Cassandra• Fast Distributed NoSQL Database
• High Availability
• Linear Scalability => Predictability
• No SPOF
• Multi-DC
• Horizontally scalable => $$$
• Not a drop in replacement for RDBMS
![Page 8: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/8.jpg)
The CAP Theorem
![Page 9: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/9.jpg)
Cluster Components
![Page 10: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/10.jpg)
Cluster Components
![Page 11: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/11.jpg)
Consistent Hashing
Hash function
“Carlos” 185664
1773456738847666528349
-894763734895827651234
![Page 12: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/12.jpg)
Replication factor
❖ How many copies (replicas) for your data
![Page 13: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/13.jpg)
Consistency LevelHow many replicas of your data
must respond ok?
![Page 14: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/14.jpg)
Availability /Partition tolerance Consistency
![Page 15: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/15.jpg)
Request Coordination
❖ Coordinator: The node designed to coordinate a particular query
❖ ANY node can coordinate ANY request.
❖ No SPOF: One of Cassandra’s ground principles
❖ The driver chooses which node will coordinate.
![Page 16: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/16.jpg)
A complete example
DriverClient
Partitioner
f81d4fae-…834
• RF = 3• CL = QUORUM• SELECT * … WHERE id = f81d4fae-…
![Page 17: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/17.jpg)
![Page 18: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/18.jpg)
Compaction
❖ SizeTieredCompactionStrategy
❖ LeveledCompactionStrategy
❖ DateTieredCompactionStrategy
❖ TimeWindowCompactionStrategy (kudos Jeff Jirsa)
CREATE TABLE user ( id UUID PRIMARY KEY, email TEXT, name TEXT, preferences SET<TEXT>, ) WITH COMPACTION = { “class”: “<strategy>”, <params> };
![Page 19: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/19.jpg)
![Page 20: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/20.jpg)
Physical Data Layout
![Page 21: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/21.jpg)
Data Modelling
![Page 22: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/22.jpg)
The Primary Key
PARTITION KEY + CLUSTERING
COLUMN(S)
![Page 23: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/23.jpg)
CQL WHERE• Full Primary Key has to be specified.
• Equality matches (only in primary keys or clustering columns):
• SELECT … WHERE album_title = ‘Revolver’;
• IN (only on the last clause):
• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);
• Range search (only on clustering columns):
• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;
![Page 24: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/24.jpg)
Data Modelling
❖ Understand your data
❖ Decide (know) how you’ll query the data
❖ Define column families to satisfy those queries
❖ Implement and optimize
![Page 25: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/25.jpg)
Data Modelling
❖ Understand your data
❖ Decide (know) how you’ll query the data
❖ Define column families to satisfy those
queries
❖ Implement and optimize
Conceptual Model
Logical Model
Physical Model
Query-DrivenMethodology
Analysis &Validation
![Page 26: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/26.jpg)
Query Driven Methodology❖ Spread data evenly around the cluster
❖ Minimise the number of partitions read
❖ Follow the mapping rules:
❖ Entities and relationships: map to tables
❖ Equality search attributes: must be at the beginning of the primary key
❖ Inequality search attributes: become clustering columns
❖ Ordering attributes: become clustering columns
❖ Key attributes: map to primary key columns
![Page 27: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/27.jpg)
Analysis & Validation
❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ How much data duplication? (batches)
❖ Client side joins or new table?
![Page 28: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/28.jpg)
An E-Library project.
![Page 29: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/29.jpg)
Requirement: 1Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher.
QDMBook
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
![Page 30: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/30.jpg)
Analysis & Validation❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ 1 X (5 - 1 - 0) + 0 < 1M
❖ How much data duplication? (batches)
❖ Client side joins or new table?
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
N/A
N/A
![Page 31: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/31.jpg)
Physical data model
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBNCREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR );
SELECT * FROM books WHERE ISBN = ‘…’;
![Page 32: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/32.jpg)
Requirement 2Users register into the system uniquely identified by an email and a password. We also want their full name.
They will be accessed by email and password or internal unique ID.
QDM K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
Users_by_login_info
email K
password
Q2, Q3
Q2: Find users by login info
C
full_name
ID
Q3: Find users by email (to guarantee uniqueness)
![Page 33: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/33.jpg)
Analysis & Validation❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ 1 X (2 - 1 - 0) + 0 < 1M
❖ How much data duplication? (batches)
❖ Client side joins or new table?
N/A
N/A
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
![Page 34: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/34.jpg)
Physical Data Model
CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR );
SELECT * FROM users_by_id WHERE ID = …;
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
![Page 35: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/35.jpg)
Analysis & Validation❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ 1 X (4 - 1 - 0) + 0 < 1M
❖ How much data duplication? (batches)
❖ Client side joins or new table? N/A
Q2: Find users by login info
Users_by_login_info
email K
password C
full_name
ID
Q3: Find users by email (to guarantee uniqueness)
![Page 36: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/36.jpg)
Physical Data ModelCREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY KEY (email, password) );
SELECT * FROM users_by_login_info WHERE email = ‘…’ [AND password = ‘…’];
Q2: Find users by login info
Users_by_login_info
email K
password C
full_name
ID
Q3: Find users by email (to guarantee uniqueness)
![Page 37: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/37.jpg)
Physical Data Model
BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…); APPLY BATCH;
![Page 38: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/38.jpg)
Requirement 3Users read books.
We want to know which books has a user read.
QDMK
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
Q1: Find all books a loggeduser has read
S
![Page 39: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/39.jpg)
Analysis & Validation❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user
❖ How much data duplication? (batches)
❖ Client side joins or new table?
Q1: Find all books a loggeduser has read
New table
N/A
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S
![Page 40: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/40.jpg)
Physical Data ModelCREATE TABLE books_read_by_user ( user_id TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR STATIC, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR, PRIMARY KEY (user_id, title, author) );
SELECT * FROM books_read_by_user WHERE user_ID = …;
Q1: Find all books a loggeduser has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S
![Page 41: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/41.jpg)
Requirement 4In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have
with our site.
QDMuser_ID
2_weeks
time
element
type
K
Actions_by_user
Q1
Q1: Find all actions a user does in a time range
K
C
![Page 42: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/42.jpg)
Analysis & Validation❖ 1 Partition per read?
❖ Are write conflicts (overwrites) possible?
❖ How large are partitions?
❖ Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
❖ Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks
❖ How much data duplication? (batches)
❖ Client side joins or new table?
N/A
K
Actions_by_user
user_ID
2_weeks
Q1
Q1: Find all actions a user does in a time range
K
time
element
type
C
N/A
![Page 43: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/43.jpg)
Physical Data ModelCREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, 2_weeks), time) );
SELECT * FROM actions_by_user WHERE user_ID = … AND 2_weeks = … AND time < … AND time > …;
K
Actions_by_user
user_ID
2_weeks
Q1
Q1: Find all actions a user does in a time range
K
time
element
type
C
![Page 44: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/44.jpg)
Further validation
∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv
CREATE TABLE actions_by_user ( user_ID TIMEUUID, 2_weeks INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, 2_weeks), time) );
![Page 45: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/45.jpg)
Next Steps
❖ Test your models against your hardware setup
❖ cassandra-stress
❖ http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez)
❖ Monitor everything
❖ DataStax OpsCenter
![Page 46: Cassandra Scalable data modelling by example](https://reader031.vdocuments.us/reader031/viewer/2022012914/587841151a28ab707b8b6787/html5/thumbnails/46.jpg)
Thanks!
Carlos AlonsoSoftware Engineer
@calonso