cassandra20141009

Details And Data Modeling

Agenda

Quick Review Of Cassandra

New Developments In Cassandra

Basic Data Modeling Concepts

Materialized Views

Secondary Indexes

Counters

Time Series Data

Expiring Data

2

Cassandra High Level

Cassandra's architecture is based on the

combination of two technologies

Google BigTable – Data Model

Amazon Dynamo – Distributed

Architecture

Cassandra = C*

3

Architecture Basics &

Terminology

Nodes are single instances of C*

Cluster is a group of nodes

Data is organized by keys (tokens) which

are distributed across the cluster

Replication Factor (rf) determines how

many copies are key

Data Center Aware

Consistency Level – powerful feature to

tune consistency vs speed vs availability.’

4

C* Ring

5

More Architecture

Information on who has what data and

who is available is transferred using

gossip.

No single point of failure (SPF), every

node can service requests.

Data Center Aware

6

CAP Theorem

Distributed Systems Law:

Consistency

Availability

Partition Tolerance(you can only really have two in a distributed system)

Cassandra is AP with Eventual

Consistency

7

Consistency

Cassandra Uses the concept of Tunable

Consistency, which make it very

powerful and flexible for system needs.

8

C* Persistence Model

9

Read Path

10

Write Path

11

Data Model Architecture

Keyspace – container of column families

(tables). Defines RF among others.

Table – column family. Contains

definition of schema.

Row – a “record” identified by a key

Column - a key and a value

12

Keys

Primary Key

Partition Key – identifies a row

Cluster Key – sorting within a row

Using CQL these are defined together as a compound (composite) key

Compound keys are how you implement “wide rows” which we will look at a lot!

14

Single Primary Key

create table users (

user_id UUID PRIMARY KEY,

firstname text,

lastname text,

emailaddres text

);

** Cassandra Data Typeshttp://www.datastax.com/documentation/cql/3.0/cql/cql

_reference/cql_data_types_c.html

15

http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/cql_data_types_c.html

Compound Key


emailaddress text,

department text,

firstname text,

lastname text,

PRIMARY KEY (emailaddress, department)

);

Partition Key plus Cluster Key

emailaddress is partition key

department is cluster key

16

Compound Key


emailaddress text,

department text,

country text,

firstname text,

lastname text,

PRIMARY KEY ((emailaddress, department), country)

);

Partition Key plus Cluster Key

Emailaddress & department is partition key

country is cluster key

17

Deletions

Distributed systems present unique problem for deletes. If it actually deleted data and a node was down and didn’t receive the delete notice it would try and create record when came back online. So…

Tombstone - The data is replaced with a special value called a Tombstone, works within distributed architecture

18

New Rules

Writes Are Cheap

Denormalize All You Need

Model Your Queries, Not Data

(understand access patterns)

Application Worries About Joins

19

What’s New In 2.0

Conditional DDL

IF Exists or If Not Exists

Drop Column Support

ALTER TABLE users DROP lastname;

20

More New Stuff

Triggers

CREATE TRIGGER myTrigger

ON myTable

USING 'com.thejavaexperts.cassandra.updateevt'

Lightweight Transactions (CAS)UPDATE users

SET firstname = 'tim'

WHERE emailaddress = '[email protected]'

IF firstname = 'tom';

** Not like an ACID Transaction!!

21

CAS & Transactions

CAS - compare-and-set operations. In a

single, atomic operation compares a

value of a column in the database and

applying a modification depending on

the result of the comparison.

Consider performance hit. CAS is (was)

considered an anti-pattern.

22

Data Modeling… The

Basics Cassandra now is very familiar to

RDBMS/SQL users.

Very nicely hides the underlying data

storage model.

Still have all the power of Cassandra, it

is all in the key definition.

RDBMS = model data

Cassandra = model access (queries)

23

Side-Note On Querying

Create table with compound key

Select using ALLOW FILTERING

Counts

Select using IN or =

24

Batch Operations

Saves Network Roundtrips

Can contain INSERT, UPDATE,

DELETE

Atomic by default (all or nothing)

Can use timestamp for specific ordering

25

Batch Operation Example

BEGIN BATCH

INSERT INTO users (emailaddress, firstname, lastname, country)

values ('[email protected]', 'brian', 'enochson', 'USA');


values ('[email protected]', 'tom', 'peters', 'DE');


values ('[email protected]', 'jim', 'smith', 'USA');


values ('[email protected]', 'alan', 'rogers', 'USA');

DELETE FROM users WHERE emailaddress = '[email protected]';

APPLY BATCH;

select in cqlsh

List in cassandra-cli with timestamp

26

More Data Modeling…

No Joins

No Foreign Keys

No Third (or any other) Normal Form Concerns

Redundant Data Encouraged. Apps maintain consistency.

27

Secondary Indexes

Allow defining indexes to allow other

access than partition key.

Each node has a local index for its data.

They have uses, but shouldn’t be used

all the time without consideration.

We will look at alternatives.

28

Secondary Index Example

Create a table

Try to select with column not in PK

Add Secondary Index

Try select again.

29

When to use?

Low Cardinality – small number of unique

values

High Cardinality – high number of distinct

values

Secondary Indexes are good for Low

Cardinality. So country codes, department

codes etc. Not email addresses.

30

Materialized View

Want full distribution can use what is

called a Materialized View pattern.

Remember redundant data is fine.

Model the queries

31

Materialized View Example

Show normal able with compound key and querying limitations

Create Materialized View Table With Different Compound Key, support alternate access.

Selects use partition key. Secondary indexes local, not distributed

Allow filtering. Can cause performance issues

32

Counters

Updated in 2.1 and now work in a more

distributed and accurate manner.

Table organization, example

How to update, view etc.

33

Time Series Example….

Time series table model.

Need to consider interval for event

frequency and wide row size.

Make what is tracked by time and unit of

interval partition key.

34

Time Series Data

Due to its quick writing model

Cassandra is suited for storing time

series data.

The Cassandra wide row is a perfect fit

for modeling time series / time based

events.

Let’s look at an example….

35

Event Data

Notice primary key and cluster key.

Insert some data

View in CQL, then in CLI as wide row

36

TTL – Self Expiring Data

Another technique is data that has a

defined lifespan.

For instance session identifiers,

temporary passwords etc.

For this Cassandra provides a Time To

Live (TTL) mechanism.

37

TTL Example…

Create table

Insert data using TTL

Can update specific column with table

Show using selects.

38

Questions

Email: [email protected]

Twitter: @benochso

G+: https://plus.google.com/+BrianEnochson

39

mailto:[email protected]

cassandra20141009

Technology