cassandra20141113

40
OVERVIEW AND REAL WORLD APPLICATIONS Cassandra Jersey Shore Tech Meetup Nov 13, 2014

Upload: brian-enochson

Post on 07-Jul-2015

89 views

Category:

Software


0 download

DESCRIPTION

Cassandra presentation at Jersey Shore Tech - First of two part "Big Data Workshop"

TRANSCRIPT

Page 1: Cassandra20141113

O V E R V I E W A N D R E A L W O R L D A P P L I C A T I O N S

Cassandra

Jersey Shore Tech Meetup

Nov 13, 2014

Page 2: Cassandra20141113

You Are Not Here…*** http://njhalloffame.org/

2

Page 3: Cassandra20141113

Agenda3

Some Basic Concepts/Overview

New Developments In Cassandra

Basic Data Modeling Concepts

Materialized Views

Secondary Indexes

Counters

Time Series Data

Expiring Data

Page 4: Cassandra20141113

Cassandra High Level4

Cassandra's architecture is based on the combination of two technologies:

Google BigTable – Data Model

Amazon Dynamo – Distributed Architecture

BTW – these mean the same thing ->

Cassandra = C*

Page 5: Cassandra20141113

Architecture Basics & Terminology5

Nodes are single instances of C*

Cluster is a group of nodes

Data is organized by keys (tokens) which are distributed across the cluster

Replication Factor (rf) determines how many copies are key

Data Center Aware – works well in multi-DC/EC2 etc.

Consistency Level – powerful feature to tune consistency vs. speed vs. availability.’

Page 6: Cassandra20141113

C* Ring6

Page 7: Cassandra20141113

More Architecture7

Information on who has what data and who is available is transferred using gossip.

No single point of failure (SPF), every node can service requests.

Handles Replication and Downed Nodes (within reason)

Page 8: Cassandra20141113

CAP Theorem8

Distributed Systems Law:

Consistency

Availability

Partition Tolerance

(you can only really have two in a distributed system)

Cassandra is AP with Eventual Consistency

Page 9: Cassandra20141113

Consistency9

Cassandra Uses the concept of Tunable Consistency, which make it very powerful and flexible for system needs.

Page 10: Cassandra20141113

C* Persistence Model10

Page 11: Cassandra20141113

Read Path11

Page 12: Cassandra20141113

Write Path12

Page 13: Cassandra20141113

Data Model Architecture13

Keyspace – container of column families (tables). Defines RF among others.

Table – column family. Contains definition of schema.

Row – a “record” identified by a key

Column - a key and a value

Page 14: Cassandra20141113

14

Page 15: Cassandra20141113

Deletions15

Distributed systems present unique problem for deletes. If it actually deleted data and a node was down and didn’t receive the delete notice it would try and create record when came back online. So…

Tombstone - The data is replaced with a special value called a Tombstone, works within distributed architecture

Page 16: Cassandra20141113

Keys16

Primary Key

Partition Key – identifies a row

Cluster Key – sorting within a row

Using CQL these are defined together as a compound (composite) key

Compound keys are how you implement “wide rows”, the COOL FEATURE!

Page 17: Cassandra20141113

Single Primary Key17

create table users (

user_id UUID PRIMARY KEY,

firstname text,

lastname text,

emailaddres text

);

** Cassandra Data Types

http://www.datastax.com/documentation/cql/3.0/cql/cql_ref

erence/cql_data_types_c.html

Page 18: Cassandra20141113

Compound Key18

create table users (

emailaddress text,

department text,

firstname text,

lastname text,

PRIMARY KEY (emailaddress, department)

);

Partition Key plus Cluster Key

emailaddress is partition key

department is cluster key

Page 19: Cassandra20141113

Compound Key19

create table users (

emailaddress text,

department text,

country text,

firstname text,

lastname text,

PRIMARY KEY ((emailaddress, department), country)

);

Partition Key plus Cluster Key

Emailaddress & department is partition key

country is cluster key

Page 20: Cassandra20141113

New Rules20

Writes Are Cheap

Denormalize All You Need

Model Your Queries, Not Data (understand access patterns)

Application Worries About Joins

Page 21: Cassandra20141113

What’s New In 2.021

Conditional DDL

IF Exists or If Not Exists

Drop Column Support

ALTER TABLE users DROP lastname;

Page 22: Cassandra20141113

More New Stuff22

Triggers

CREATE TRIGGER myTrigger

ON myTable

USING 'com.thejavaexperts.cassandra.updateevt'

Lightweight Transactions (CAS)UPDATE users

SET firstname = 'tim'

WHERE emailaddress = '[email protected]'

IF firstname = 'tom';

** Not like an ACID Transaction!!

Page 23: Cassandra20141113

CAS & Transactions23

CAS - compare-and-set operations. In a single, atomic operation compares a value of a column in the database and applying a modification depending on the result of the comparison.

Consider performance hit. CAS is (was) considered an anti-pattern.

Page 24: Cassandra20141113

Data Modeling… The Basics24

Cassandra now is very familiar to RDBMS/SQL users.

Very nicely hides the underlying data storage model.

Still have all the power of Cassandra, it is all in the key definition.

RDBMS = model data

Cassandra = model access (queries)

Page 25: Cassandra20141113

Side-Note On Querying25

Create table with compound key

Select using ALLOW FILTERING

Counts

Select using IN or =

Page 26: Cassandra20141113

Batch Operations26

Saves Network Roundtrips

Can contain INSERT, UPDATE, DELETE

Atomic by default (all or nothing)

Can use timestamp for specific ordering

Page 27: Cassandra20141113

Batch Operation Example27

BEGIN BATCH

INSERT INTO users (emailaddress, firstname, lastname, country) values

('[email protected]', 'brian', 'enochson', 'USA');

INSERT INTO users (emailaddress, firstname, lastname, country) values

('[email protected]', 'tom', 'peters', 'DE');

INSERT INTO users (emailaddress, firstname, lastname, country) values

('[email protected]', 'jim', 'smith', 'USA');

INSERT INTO users (emailaddress, firstname, lastname, country) values

('[email protected]', 'alan', 'rogers', 'USA');

DELETE FROM users WHERE emailaddress = '[email protected]';

APPLY BATCH;

select in cqlsh

List in cassandra-cli with timestamp

Page 28: Cassandra20141113

More Data Modeling…28

No Joins

No Foreign Keys

No Third (or any other) Normal Form Concerns

Redundant Data Encouraged. Apps maintain consistency.

Page 29: Cassandra20141113

Secondary Indexes29

Allow defining indexes to allow other access than partition key.

Each node has a local index for its data.

They have uses, but shouldn’t be used all the time without consideration.

We will look at alternatives.

Page 30: Cassandra20141113

Secondary Index Example30

Create a table

Try to select with column not in PK

Add Secondary Index

Try select again. (maybe need to reinsert)

Page 31: Cassandra20141113

When to use?31

Low Cardinality – small number of unique values

High Cardinality – high number of distinct values

Secondary Indexes are good for Low Cardinality. So country codes, department codes etc. Not email addresses.

Page 32: Cassandra20141113

Materialized View32

Want full distribution can use what is called a Materialized View pattern.

Remember redundant data is fine.

Model the queries

Page 33: Cassandra20141113

Materialized View Example33

Show normal able with compound key and querying limitations

Create Materialized View Table With Different Compound Key, support alternate access.

Selects use partition key.

Secondary indexes local, not distributed

Allow filtering. Can cause performance issues

Page 34: Cassandra20141113

Counters34

Updated in 2.1 and now work in a more distributed and accurate manner.

Table organization, example

How to update, view etc.

Page 35: Cassandra20141113

Time Series Example….35

Time series table model.

Need to consider interval for event frequency and wide row size.

Make what is tracked by time and unit of interval partition key.

Page 36: Cassandra20141113

Time Series Data36

Due to its quick writing model Cassandra is suited for storing time series data.

The Cassandra wide row is a perfect fit for modeling time series / time based events.

Let’s look at an example….

Page 37: Cassandra20141113

Event Data37

Notice primary key and cluster key.

Insert some data

View in CQL, then in CLI as wide row

Page 38: Cassandra20141113

TTL – Self Expiring Data38

Another technique is data that has a defined lifespan.

For instance session identifiers, temporary passwords etc.

For this Cassandra provides a Time To Live (TTL) mechanism.

Page 39: Cassandra20141113

TTL Example…39

Create table

Insert data using TTL

Can update specific column with table

Show using selects.

Page 40: Cassandra20141113

Questions40

http://www.thejavaexperts.net/

Email: [email protected]

Twitter: @benochso

G+: https://plus.google.com/+BrianEnochson