nyc* jonathan ellis keynote: "cassandra 1.2 + 2.0"

Post on 10-Jul-2015

5.342 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Jonathan Ellis, Apache Cassandra Project Chair & DataStax Co-Founder, presents Apache Cassandra 1.2 + 2.0.

TRANSCRIPT

Cassandra 1.2 (and 2.0)Jonathan Ellis Project Chair, Apache Cassandra CTO, DataStax@spyced

©2012 DataStax

©2012 DataStax

• Massively scalable

• High performance

• Reliable/Available

©2012 DataStax

VLDB benchmark (RWS)

©2012 DataStax

Endpoint benchmark (RW)

©2012 DataStax

©2012 DataStax

©2012 DataStax

1.2• Concurrent schema

changes

• Virtual nodes

• “Fat node” support

• JBOD improvements• Off-heap bloom filters,

compression metadata• Parallel leveled

compaction

• Atomic batches

• CQL3

• Collections• Data dictionary

• Tracing

©2012 DataStax

Concurrent Schema Changes

CassandraCluster

Client

CREATE TABLE X;...

DROP TABLE X;

Client

CREATE TABLE Y;...

DROP TABLE Y;

©2012 DataStax

Virtual nodes

F

C

B

E

A

D

Ring without vnodes

A

N

K

H

E

JM

Ring with vnodes

C

F

P

B

L

I

O

D

G

©2012 DataStax

Virtual nodes

F

C

B

E

A

D

Ring without vnodes

A

N

K

H

E

JM

Ring with vnodes

C

F

P

B

L

I

O

D

G

©2012 DataStax

Virtual nodes

F

C

B

E

A

D

Ring without vnodes

A

N

K

H

E

JM

Ring with vnodes

C

F

P

B

L

I

O

D

G

©2012 DataStax

Node Rebuild without vnodes

F

C

B

E

A

D

Ring without vnodes

A

F E

Node 1 Node 2 Node 3

Node 4 Node 6Node 5

B

A F

C

B A

D

B

E

D C

F

DC E

©2012 DataStax

Node Rebuild with vnodes

A

N

K

H

E

JM

Ring with VNodes

C

F

P

B

L

I

O

D

G

B

G

E

K

D J

L

A

O

D H

K F

K G

J F

P

M

I

O

H

B L

F D

E

I

P

A

M C

G N

H

B

C

O

N

J L

Node 1 Node 2 Node 3

Node 4 Node 6Node 5

E

M

I

C N

P

A

©2012 DataStax

JBOD support

HDD2HDD1

Cassandra Instance

HDD3 HDD4

©2012 DataStax

JBOD support

HDD2HDD1

Cassandra Instance

HDD3 HDD4X

©2012 DataStax

On-Heap/Off-Heap

Java Heap

Off-HeapNot managed by GC

JVM

Java Process

Native Memory

On-HeapManaged by GC

©2012 DataStax

Moving O(n) structures off-heap• Row (partition) bloom filter• 1-2GB per billion rows

• Compression metadata• ~20GB per TB compressed data

• 1.2 targets 5-10TB of data per machine

©2012 DataStax

Batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client

©2012 DataStax

Batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client

©2012 DataStax

Batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client

©2012 DataStax

Batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client

©2012 DataStax

Batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client X

©2012 DataStax

Atomic batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client

BatchlogNode

©2012 DataStax

Atomic batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client

BatchlogNode

©2012 DataStax

Atomic batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client

BatchlogNode

©2012 DataStax

Atomic batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client

BatchlogNode

©2012 DataStax

Atomic batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client

BatchlogNode

X

©2012 DataStax

Atomic batches

CoordinatorNode

PartitionReplica

PartitionReplica

PartitionReplica

Client

BatchlogNode

X

©2012 DataStax

CREATE TABLE users ( id uuid PRIMARY KEY, name text, state text, birth_date int);

CREATE INDEX ON users(state);

SELECT * FROM users WHERE state=‘Texas’ AND birth_date > 1950;

CQL: You got SQL in my NoSQL!

©2012 DataStax

Strictly “realtime” focused• No joins

• No subqueries

• No aggregation functions* or GROUP BY

• Strictly limited ORDER BY

©2012 DataStax

a3e64f8f... title: La Grange artist: ZZ Top album: Tres Hombres

8a172618... title: Moving in Stereo artist: Fu Manchu album: We Must Obey

2b09185b... title: Outside Woman Blues artist: Back Door Slam album: Roll Away

songscreate column family songswith key_validation_class = UUIDTypeand comparator = UTF8Type -- cell names are stringsand column_metdata = [{column_name: title, validation_class: UTF8Type}, {column_name: album, validation_class: UTF8Type}, {column_name: artist, validation_class: UTF8Type}, {column_name: data, validation_class: BytesType}];

©2012 DataStax

id title artist album

a3e64f8f... La Grange ZZ Top Tres Hombres8a172618... Moving in Stereo Fu Manchu We Must Obey2b09185b... Outside Woman Blues Back Door Slam Roll Away

CREATE TABLE songs ( id uuid PRIMARY KEY, title text, artist text, album text, data blob);

©2012 DataStax

a3e64f8f... blues: 1973:

8a172618... covers: 2003:

song_tagscreate column family song_tagswith key_validation_class = UUIDTypeand comparator = UTF8Type;

©2012 DataStax

a3e64f8f... blues: 1973:

8a172618... covers: 2003:

id tag_name

a3e64f8f... bluesa3e64f8f... 1973

8a172618... covers8a172618... 2003

CREATE TABLE song_tags ( id uuid, tag_name text, PRIMARY KEY (id, tag_name));

©2012 DataStax

62c36092... La Grange,ZZ Top,Tres Hombres

: a3e64f8f...Moving in S...,Fu Manchu,We Must O...

: 8a172618...Outside Wo...,Back Door ...,Roll Away

: 2b09185b...

playlistscreate column family playlistswith key_validation_class = UUIDTypeand comparator = 'CompositeType(UTF8Type, UTF8Type, UTF8Type)'and default_validation_class = UUIDType;

©2012 DataStax

62c36092... La Grange,ZZ Top,Tres Hombres

: a3e64f8f...Moving in S...,Fu Manchu,We Must O...

: 8a172618...Outside Wo...,Back Door ...,Roll Away

: 2b09185b...

playlistscreate column family playlistswith key_validation_class = UUIDTypeand comparator = 'CompositeType(UTF8Type, UTF8Type, UTF8Type)'and default_validation_class = UUIDType;

©2012 DataStax

62c36092... La Grange,ZZ Top,Tres Hombres

: a3e64f8f...Moving in S...,Fu Manchu,We Must O...

: 8a172618...Outside Wo...,Back Door ...,Roll Away

: 2b09185b...

id title artist album song_id

62c36092... La Grange ZZ Top Tres Hombres a3e64f8f...

62c36092... Moving in Stereo Fu Manchu We Must Obey 8a172618...

62c36092... Outside Wo... Back Door Slam Roll Away 2b09185b...

CREATE TABLE playlists ( id uuid, title text, album text, artist text, song_id uuid, PRIMARY KEY (id, title, album, artist));

©2012 DataStax

Collections

id title artist album tags

a3e64f8f... La Grange ZZ Top Tres Hombres {blues, 1973}8a172618... Moving in Stereo Fu Manchu We Must Obey {covers, 2003}2b09185b... Outside Woman Blues Back Door Slam Roll Away

CREATE TABLE songs ( id uuid PRIMARY KEY, title text, artist text, album text, tags set<text>, data blob);

©2012 DataStax

cqlsh:system> SELECT * FROM schema_keyspaces;

keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}

Data dictionary

©2012 DataStax

cqlsh:system> SELECT * FROM schema_keyspaces;

keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}

Data dictionary

©2012 DataStax

cqlsh:system> SELECT * FROM schema_keyspaces;

keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}

Data dictionary

cqlsh:system> SELECT * FROM schema_columnfamilies WHERE keyspace_name='keyspace1' AND columnfamily_name='test';

©2012 DataStax

cqlsh:system> SELECT * FROM schema_keyspaces;

keyspace_name | durable_writes | strategy_class | strategy_options---------------+----------------+----------------+---------------------------- keyspace1 | True | SimpleStrategy | {"replication_factor":"1"} system | True | LocalStrategy | {} system_traces | True | SimpleStrategy | {"replication_factor":"1"}

Data dictionary

cqlsh:system> SELECT * FROM schema_columnfamilies WHERE keyspace_name='keyspace1' AND columnfamily_name='test';

cqlsh:system> SELECT * FROM schema_columns WHERE keyspace_name='keyspace1' AND columnfamily_name='test';

©2012 DataStax

cqlsh:system> SELECT * FROM local;

key | bootstrapped | cluster_name | cql_version | data_center | gossip_generation | partitioner | rack | release_version | ring_id | thrift_version | tokens | truncated_at-------+--------------+--------------+-------------+-------------+-------------------+---------------------------------------------+-------+----------------------+--------------------------------------+----------------+--------+-------------- local | COMPLETED | test | 3.0.0 | datacenter1 | 1352846064 | org.apache.cassandra.dht.Murmur3Partitioner | rack1 | 1.2.0-beta2-SNAPSHOT | 224c55d5-21b4-42b0-8969-afc0cc04e812 | 19.35.0 | {0} | null

Data dictionary

©2012 DataStax

cqlsh:system> SELECT * FROM peers LIMIT 1;

peer | data_center | rack | release_version | ring_id | rpc_address | schema_version | tokens-----------+-------------+-------+----------------------+--------------------------------------+-------------+--------------------------------------+----------------------- 127.0.0.3 | datacenter1 | rack1 | 1.2.0-beta2-SNAPSHOT | f6782327-ef8e-41cf-87b9-2edc287b1ffe | 127.0.0.3 | 915ed888-ddd0-3448-860c-582f4eea1bc6 | {6148914691236517204}

Data dictionary

©2012 DataStax

Request tracingcqlsh:foo> INSERT INTO bar (i, j) VALUES (6, 2);Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9

activity | timestamp | source | source_elapsed-------------------------------------+--------------+-----------+---------------- Determining replicas for mutation | 00:02:37,015 | 127.0.0.1 | 540 Sending message to /127.0.0.2 | 00:02:37,015 | 127.0.0.1 | 779 Message received from /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 63 Applying mutation | 00:02:37,016 | 127.0.0.2 | 220 Acquiring switchLock | 00:02:37,016 | 127.0.0.2 | 250 Appending to commitlog | 00:02:37,016 | 127.0.0.2 | 277 Adding to memtable | 00:02:37,016 | 127.0.0.2 | 378 Enqueuing response to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 710 Sending message to /127.0.0.1 | 00:02:37,016 | 127.0.0.2 | 888 Message received from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2334 Processing response from /127.0.0.2 | 00:02:37,017 | 127.0.0.1 | 2550

©2012 DataStax

CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at));

id created_at value

myqueue 3092e86f 9b0450d30de9

myqueue 0867f47c fc7aee5f6a66

myqueue 5fc74be0 668fdb3a2196

Tracing an antipattern

©2012 DataStax

CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at));

id created_at value

myqueue 3092e86f 9b0450d30de9

myqueue 0867f47c fc7aee5f6a66

myqueue 5fc74be0 668fdb3a2196

Tracing an antipattern

©2012 DataStax

CREATE TABLE queues ( id text, created_at timeuuid, value blob, PRIMARY KEY (id, created_at));

id created_at value

myqueue 3092e86f 9b0450d30de9

myqueue 0867f47c fc7aee5f6a66

myqueue 5fc74be0 668fdb3a2196

©2012 DataStax

cqlsh:foo> SELECT FROM queues WHERE id = 'myqueue' ORDER BY created_at LIMIT 1;Tracing session: 4ad36250-1eb4-11e2-0000-fe8ebeead9f9

activity | timestamp | source | source_elapsed------------------------------------------+--------------+-----------+--------------- execute_cql3_query | 19:31:05,650 | 127.0.0.1 | 0 Sending message to /127.0.0.3 | 19:31:05,651 | 127.0.0.1 | 541 Message received from /127.0.0.1 | 19:31:05,651 | 127.0.0.3 | 39 Executing single-partition query | 19:31:05,652 | 127.0.0.3 | 943 Acquiring sstable references | 19:31:05,652 | 127.0.0.3 | 973 Merging memtable contents | 19:31:05,652 | 127.0.0.3 | 1020 Merging data from memtables and sstables | 19:31:05,652 | 127.0.0.3 | 1081 Read 1 live cells and 100000 tombstoned | 19:31:05,686 | 127.0.0.3 | 35072 Enqueuing response to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35220 Sending message to /127.0.0.1 | 19:31:05,687 | 127.0.0.3 | 35314 Message received from /127.0.0.3 | 19:31:05,687 | 127.0.0.1 | 36908 Processing response from /127.0.0.3 | 19:31:05,688 | 127.0.0.1 | 37650 Request complete | 19:31:05,688 | 127.0.0.1 | 38047

©2012 DataStax

2.0• Eager retries• Improved compaction• Triggers• CAS (Compare-and-set)• More-efficient repair

©2012 DataStax

Eager retries

Client Coordinator

40% busy

90% busy

30% busy

©2012 DataStax

Eager retries

Client Coordinator

40% busy

90% busy

30% busy

©2012 DataStax

Eager retries

Client Coordinator

40% busy

90% busy

30% busy

©2012 DataStax

Improved compaction• Specialized strategy for append-only with TTL

• Can we do any better for a general-purpose workload?

©2012 DataStax

©2012 DataStax

CREATE TRIGGER fooBEFORE UPDATEON usersEXECUTE ’/var/lib/cassandra/triggers/send_registration_email.jar’

Triggers

©2012 DataStax

class MyTrigger implements ITrigger{ public Collection<RowMutation> revise(ByteBuffer key, ColumnFamily update) { ... }}

Triggers

©2012 DataStax

SELECT * FROM usersWHERE username = ’jbellis’

[empty resultset]

INSERT INTO users (...)VALUES (’jbellis’, ...)

CAS

Session 1SELECT * FROM usersWHERE username = ’jbellis’

[empty resultset]

INSERT INTO users (...)VALUES (’jbellis’, ...)

Session 2

©2012 DataStax

CAS• Locking does not solve this problem

• 2PC does not solve this problem

• Locking + 2PC does not solve this problem

©2012 DataStax

Paxos!

©2012 DataStax

Open questions

UPDATE USERS SET email = ‘jonathan@datastax.com’, ...WHERE username = ’jbellis’IF email = ‘jbellis@datastax.com’

• What do we call it?• Conditional write guarantee?• Atomic conditional updates?• Lightweight transactions?

• What syntax do we use for CQL?

©2012 DataStax

More-efficient repair

©2012 DataStax

More-efficient repair

©2012 DataStax

More-efficient repair

©2012 DataStax

More-efficient repair

©2012 DataStax

More-efficient repair

©2012 DataStax

More-efficient repair

©2012 DataStax

More-efficient repair

©2012 DataStax

More-efficient repair

©2012 DataStax

More-efficient repair

©2012 DataStax

More-efficient repair

©2012 DataStax

More-efficient repair

©2012 DataStax

Consequences• Repair won’t replace missing data due to

hardware failure by default

• Add --include-previously-repaired to force old-style full validation

top related