cassandra 3.0 awesomeness

©2013 DataStax Confidential. Do not distribute without consent.

@rustyrazorblade

Jon HaddadTechnical Evangelist, DataStax

Cassandra 3.0 Awesomeness

1

2.2 Stuff First

User Defined Functions• Apply functions to data in a table • Defined as Java or Javascript • Bring your own Python/Ruby

CREATE OR REPLACE FUNCTION fLog (input double) CALLED ON NULL INPUT RETURNS double LANGUAGE java AS 'return Double.valueOf(Math.log(input.doubleValue()));';

Functions in Action

cqlsh:test> create table blah2 ( id double primary key, name text );

cqlsh:test> insert into blah2 (id, name) values (1.0334343, 'jon'); cqlsh:test> select flog(id) from blah2;

test.flog(id) --------------- 0.032888

(1 rows)

Aggregates• Several built ins • min(), max(), avg(), count(), sum()• Can provide user defined aggregates • Defined as Java or JavaScript • Do not aggregate across partitions • Enable in cassandra.yaml

Native JSON Support

INSERT INTO mytable JSON '{"myKey": 0, "value": 0}'

SELECT JSON name, occupation FROM users WHERE userid = 199;

3.0 Stuff

G1GC• Improvement over ParNew+CMS • Hard to tune • CASSANDRA-8150

• G1 has more predictable pauses • Better latency •Many new gen, many old gen • G1 is adaptive to usage

E SO

SO E

O S

EE

Eden Old GenS0 S1

Improved vnode allocation• Previous method was randomly allocate • vnode problems • increased sockets • repairs take longer

•New clusters can allocate less (4-12) • CASSANDRA-7032

Pre 3.0 Hints• Cassandra is a pretty bad queue • Pre 3.0 hints are a queue • Generates lots of tombstones • Can result in instability

CREATE TABLE system.hints ( target_id uuid, hint_id timeuuid, message_version int, mutation blob, PRIMARY KEY (target_id, hint_id, message_version) ) WITH COMPACT STORAGE

X X

3.0 Hints• CASSANDRA-9427 •Write hints to a file instead • Removes overhead of compaction •No longer using C* as a queue

Materialized Views

Cassandra Data Modeling

sensor_id timestamp value

1 1 1

1 2 2

2 1 2

2 2 1

create table sensor_data ( sensor_id int, timestamp int, value int, primary key (sensor_id, timestamp) );

Cool… but…•What if we want to query sensor

data by timestamp? •We can't efficiently query on

timestamp •Need to maintain 2 tables • In 3.0, use materialized views

CREATE MATERIALIZED VIEW sensor_by_value as SELECT value, timestamp, sensor_id FROM sensor_data WHERE value is not null AND timestamp is not null PRIMARY KEY (timestamp, sensor_id, value);

Materialized View• Table managed for you • Updated async behind the scenes • Built automatically when created • Can't be mixed w/ functions yet • CASSANDRA-9664

cqlsh:test> select * from sensor_by_value;

timestamp | sensor_id | value -----------+-----------+------- 1 | 1 | 1 1 | 2 | 2 2 | 1 | 2 2 | 2 | 1

(4 rows)

cqlsh:test> select * from sensor_by_value where timestamp = 1;

timestamp | sensor_id | value -----------+-----------+------- 1 | 1 | 1 1 | 2 | 2

(2 rows)

New Storage Engine

Pre 3.0• Clustering keys are repeated for each cell • Timestamps are repeated in each cell • TTLS are.. you get the idea • Rows are a bolted on construct, only

known by a convention • Lots of wasted space • Lots of repetition

Storage in 3.0• Rows are a first class entity • Timestamps and TTLS can be

stored on the Row • Clustering keys are not repeated • Conversion to iterators for

memory efficiency

cassandra 3.0 awesomeness

Technology