cassandra20141009
TRANSCRIPT
Agenda
Quick Review Of Cassandra
New Developments In Cassandra
Basic Data Modeling Concepts
Materialized Views
Secondary Indexes
Counters
Time Series Data
Expiring Data
2
Cassandra High Level
Cassandra's architecture is based on the
combination of two technologies
Google BigTable – Data Model
Amazon Dynamo – Distributed
Architecture
Cassandra = C*
3
Architecture Basics &
Terminology
Nodes are single instances of C*
Cluster is a group of nodes
Data is organized by keys (tokens) which
are distributed across the cluster
Replication Factor (rf) determines how
many copies are key
Data Center Aware
Consistency Level – powerful feature to
tune consistency vs speed vs availability.’
4
More Architecture
Information on who has what data and
who is available is transferred using
gossip.
No single point of failure (SPF), every
node can service requests.
Data Center Aware
6
CAP Theorem
Distributed Systems Law:
Consistency
Availability
Partition Tolerance(you can only really have two in a distributed system)
Cassandra is AP with Eventual
Consistency
7
Consistency
Cassandra Uses the concept of Tunable
Consistency, which make it very
powerful and flexible for system needs.
8
Data Model Architecture
Keyspace – container of column families
(tables). Defines RF among others.
Table – column family. Contains
definition of schema.
Row – a “record” identified by a key
Column - a key and a value
12
Keys
Primary Key
Partition Key – identifies a row
Cluster Key – sorting within a row
Using CQL these are defined together as a compound (composite) key
Compound keys are how you implement “wide rows” which we will look at a lot!
14
Single Primary Key
create table users (
user_id UUID PRIMARY KEY,
firstname text,
lastname text,
emailaddres text
);
** Cassandra Data Typeshttp://www.datastax.com/documentation/cql/3.0/cql/cql
_reference/cql_data_types_c.html
15
Compound Key
create table users (
emailaddress text,
department text,
firstname text,
lastname text,
PRIMARY KEY (emailaddress, department)
);
Partition Key plus Cluster Key
emailaddress is partition key
department is cluster key
16
Compound Key
create table users (
emailaddress text,
department text,
country text,
firstname text,
lastname text,
PRIMARY KEY ((emailaddress, department), country)
);
Partition Key plus Cluster Key
Emailaddress & department is partition key
country is cluster key
17
Deletions
Distributed systems present unique problem for deletes. If it actually deleted data and a node was down and didn’t receive the delete notice it would try and create record when came back online. So…
Tombstone - The data is replaced with a special value called a Tombstone, works within distributed architecture
18
New Rules
Writes Are Cheap
Denormalize All You Need
Model Your Queries, Not Data
(understand access patterns)
Application Worries About Joins
19
What’s New In 2.0
Conditional DDL
IF Exists or If Not Exists
Drop Column Support
ALTER TABLE users DROP lastname;
20
More New Stuff
Triggers
CREATE TRIGGER myTrigger
ON myTable
USING 'com.thejavaexperts.cassandra.updateevt'
Lightweight Transactions (CAS)UPDATE users
SET firstname = 'tim'
WHERE emailaddress = '[email protected]'
IF firstname = 'tom';
** Not like an ACID Transaction!!
21
CAS & Transactions
CAS - compare-and-set operations. In a
single, atomic operation compares a
value of a column in the database and
applying a modification depending on
the result of the comparison.
Consider performance hit. CAS is (was)
considered an anti-pattern.
22
Data Modeling… The
Basics Cassandra now is very familiar to
RDBMS/SQL users.
Very nicely hides the underlying data
storage model.
Still have all the power of Cassandra, it
is all in the key definition.
RDBMS = model data
Cassandra = model access (queries)
23
Side-Note On Querying
Create table with compound key
Select using ALLOW FILTERING
Counts
Select using IN or =
24
Batch Operations
Saves Network Roundtrips
Can contain INSERT, UPDATE,
DELETE
Atomic by default (all or nothing)
Can use timestamp for specific ordering
25
Batch Operation Example
BEGIN BATCH
INSERT INTO users (emailaddress, firstname, lastname, country)
values ('[email protected]', 'brian', 'enochson', 'USA');
INSERT INTO users (emailaddress, firstname, lastname, country)
values ('[email protected]', 'tom', 'peters', 'DE');
INSERT INTO users (emailaddress, firstname, lastname, country)
values ('[email protected]', 'jim', 'smith', 'USA');
INSERT INTO users (emailaddress, firstname, lastname, country)
values ('[email protected]', 'alan', 'rogers', 'USA');
DELETE FROM users WHERE emailaddress = '[email protected]';
APPLY BATCH;
select in cqlsh
List in cassandra-cli with timestamp
26
More Data Modeling…
No Joins
No Foreign Keys
No Third (or any other) Normal Form Concerns
Redundant Data Encouraged. Apps maintain consistency.
27
Secondary Indexes
Allow defining indexes to allow other
access than partition key.
Each node has a local index for its data.
They have uses, but shouldn’t be used
all the time without consideration.
We will look at alternatives.
28
Secondary Index Example
Create a table
Try to select with column not in PK
Add Secondary Index
Try select again.
29
When to use?
Low Cardinality – small number of unique
values
High Cardinality – high number of distinct
values
Secondary Indexes are good for Low
Cardinality. So country codes, department
codes etc. Not email addresses.
30
Materialized View
Want full distribution can use what is
called a Materialized View pattern.
Remember redundant data is fine.
Model the queries
31
Materialized View Example
Show normal able with compound key and querying limitations
Create Materialized View Table With Different Compound Key, support alternate access.
Selects use partition key. Secondary indexes local, not distributed
Allow filtering. Can cause performance issues
32
Counters
Updated in 2.1 and now work in a more
distributed and accurate manner.
Table organization, example
How to update, view etc.
33
Time Series Example….
Time series table model.
Need to consider interval for event
frequency and wide row size.
Make what is tracked by time and unit of
interval partition key.
34
Time Series Data
Due to its quick writing model
Cassandra is suited for storing time
series data.
The Cassandra wide row is a perfect fit
for modeling time series / time based
events.
Let’s look at an example….
35
Event Data
Notice primary key and cluster key.
Insert some data
View in CQL, then in CLI as wide row
36
TTL – Self Expiring Data
Another technique is data that has a
defined lifespan.
For instance session identifiers,
temporary passwords etc.
For this Cassandra provides a Time To
Live (TTL) mechanism.
37
TTL Example…
Create table
Insert data using TTL
Can update specific column with table
Show using selects.
38