deep dive into cassandra

40
Deep Dive into Apache Cassandra Big Data Madison - February 2015 @brenttheisen

Upload: brent-theisen

Post on 18-Jul-2015

216 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Deep Dive into Cassandra

Deep Dive intoApache Cassandra

Big Data Madison - February 2015@brenttheisen

Page 2: Deep Dive into Cassandra

Who I Am, Where I Work• Brent Theisen, Principal Engineer at

Womply.

• Womply uses data to grow, protect and simplify small businesses.

• Obtain transaction, weather and third party site data relevant to merchants.

• Provide brick and mortar merchants with products and analytics based on this data.

• Using Cassandra since Aug 2013.

Page 3: Deep Dive into Cassandra

Cassandra History• Originally developed at Facebook around 2008.

• Name comes from Greek mythological prophet cursed with accurate predictions no one believed.

‣ Move to NoSQL inevitable but few realize it.

‣ Cassandra was a cursed Oracle.

• Modeled after the Amazon DynamoDB and Google BigTable papers.

• Open sourced on Google Code in 2008.

• Top level Apache project in 2010.

• Today its used by 100s of companies: Netflix, Twitter, eBay, Call of Duty.

Page 4: Deep Dive into Cassandra

Why Use Cassandra?

• Store terabytes of data.

• Perform a high number of writes/second.

• Perform key based queries in “real time”.

• Run massive parallel processing jobs.

• Replicate data across multiple data centers in full duplex.

• Scale horizontally in a predictable way.

Cassandra might be the right tool for the job if you need to:

Page 5: Deep Dive into Cassandra

What Can It Run On?• Written in Java. Targets Oracle Java 1.7.

• Can run on Windows, Linux, OS X and just about any UNIXish OS.

• Client drivers available for many languages:

‣ Java, C++, Python, Node.js, Ruby, C#, etc.

‣ Not all languages have clients that support all new features.

• Support for MPP frameworks:

‣ DataStax Spark Connector.

‣ Hadoop input/output format.

Page 6: Deep Dive into Cassandra

Installing Cassandra• Distributions of Cassandra:

‣ Binaries: DataStax Community (DSC) and Enterprise (DSE).

‣ Apache source distribution.

‣ Try to stick to whatever the latest stable version is.

‣ Current most stable is 2.0.12, latest is 2.1.3.

• Support available for Docker, BOSH, Chef, Puppet, etc.

• Development environments should be provisioned as close to production as possible. Docker/Vagrant are your friends.

Page 7: Deep Dive into Cassandra

Configuring Cassandra• Almost all config options are set in cassandra.yaml.

• Some of the more important ones:

‣ seeds: List of IPs that new nodes should use as Gossip contact points when joining the cluster.

‣ endpoint_snitch: Java class that informs Cassandra about network topology to efficiently route Gossip P2P requests. Some options: RackInferringSnitch, PropertyFileSnitch, EC2Snitch.

‣ initial_token: In a single node per token range, specifies the starting point of the range for the node.

‣ num_tokens: In a virtual node cluster, specifies the number of tokens randomly assigned to the node.

Page 8: Deep Dive into Cassandra

Types of Key Rings• Node per range requires

manually recalculating initial_token for all nodes when adding/removing nodes.

• Vnodes save you from this.

• If you have a mix of hardware/instance types, you can set num_tokens accordingly.

Page 9: Deep Dive into Cassandra

Cassandra CLI Tools• cqlsh: CQL client for running queries against a node(s).

• nodetool: Provides a number of subcommands useful for administration/monitoring. Really just a JMX client.

brent@cassandra1:~$ nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 172.17.0.33 76.09 KB 256 68.5% 9daf06ab-85d6-42b2-8190-49edd3d987a4 rack1 UN 172.17.0.36 76.14 KB 256 65.0% 18f008ff-0057-46c1-a71f-cc204a018808 rack1 UN 172.17.0.9 62.43 KB 256 66.5% 50c90d36-9c0a-43f0-8dd3-d66c24505ecb rack1

Output from one of the most basic commands, nodetool status:

Page 10: Deep Dive into Cassandra

Cassandra Data Model• Keyspaces (aka “schemas”) contain tables.

‣ Specify a per data center replication factor.

• Data is stored in tables (aka “column families”).

• Tables must have a primary key:

‣ Natural keys preferred but random UUIDs can also be used.

‣ Can use several columns to form a compound key.

‣ How you structure your primary key determines partitioning.

• That is all thats required, non-primary key columns optional.

Page 11: Deep Dive into Cassandra

Creating a KeyspaceCREATE KEYSPACE my_keyspace WITH REPLICATION = { 'class': 'NetworkTopologyStrategy', 'datacenter1': 3, 'datacenter2': 2 };

• Specifies a name for the keyspace and it’s replication settings.

• Two types of replication strategies:

‣ SimpleStrategy: Single data center only. Not for production.

‣ NetworkTopologyStrategy: Allows for multi-datacenter clusters.

• Replication factor specifies how many nodes a key should be stored on.

Page 12: Deep Dive into Cassandra

Creating a TableUSE my_keyspace;

CREATE TABLE users ( username text PRIMARY KEY, email text, first_name text, last_name text );

• USE just changes the current schema we are working in.

• CQL CREATE TABLE syntax pretty similar to SQL.

• The username PK defines how the data is partitioned and indexed.

Page 13: Deep Dive into Cassandra

Scalar Data Types• All the usual suspects: boolean, text, int, bigint (long), float, double, blob

• counter: Counter column (64-bit signed value). Only increment and decrement.

• varint: Variable precision integer.

• decimal: Variable precision decimal.

• inet: An IPv4 or IPv6 address.

• timestamp: Date and time including timezone. Time gets converted to UTC.

• timeuuid: Type 1 UUID: Nano second time + dupe prevent + MAC. Can use NOW().

• uuid : Type 1 or type 4 UUID. Have to generate your own.

Page 14: Deep Dive into Cassandra

Collection Data Types• list: Stores elements in whatever order you’d like.‣ Control order of values by adding and deleting values by index.‣ Individual values can also be removed.‣ Must specify a sub type in your CREATE or ALTER TABLE:

• set: Stores unique elements in natural sort order.‣ Other than uniqueness and sort order, similar to list.

• map: Stores key/value pairs.‣ Set and delete key/value pairs by key.‣ Must specify types for both the key and value in your CREATE or ALTER TABLE:

ALTER TABLE users ADD login_times list<timestamp>;

ALTER TABLE users ADD login_times set<timestamp>;

ALTER TABLE users ADD login_times set<timestamp, text>;

Page 15: Deep Dive into Cassandra

User Defined Types• New in Cassandra 2.1.• Can be used to store denormalized data that might have otherwise been stored

in another table.

CREATE TYPE address ( street text, city text, zip_code int, phones set<text> );

ALTER TABLE users ADD addresses map<text, frozen <address>>;Serialized as a BLOB so all componentsof the type must be passed on write.

Page 16: Deep Dive into Cassandra

Inserting Data

CONSISTENCY ONE;

INSERT INTO users (username, email, first_name, last_name) VALUES ('alice', '[email protected]', 'Alice', 'Smith');

How many nodes should receivethe write before query returns?

Page 17: Deep Dive into Cassandra

Consistency Levels

• ONE, TWO, THREE: Exact number of nodes that need to reply successfully.

• LOCAL_ONE: Must succeed on at least one node in local DC.

• QUORUM: A quorum of nodes must reply. (replication factor / 2) + 1

• LOCAL_QUORUM: Same as quorum but only nodes in local data center.

• ALL: Query fails if any one of the replicas does not reply.

• ANY: Only applies to writes. Guarantees that the write succeeds even if all replicas are down. Coordinator node may persist locally and replay when replica available (AKA “hinted handoff”).

Specifies how many nodes need to successfully process a read or write for query to succeed. Some of the more prevalent consistency levels:

Page 18: Deep Dive into Cassandra

Insert/Update, Same Difference

UPDATE users SET email = '[email protected]', first_name = 'Bob' WHERE username = 'bob';

In Cassandra, inserts and updates are the same thing so they are often referred to as writes or “upserts”.

Page 19: Deep Dive into Cassandra

Anatomy of a Write• Client sends query to a node which will act as a coordinator for the request.

• Coordinator determines which replicas in the ring own the key and sends the details of the write to all of them.

• Client’s query blocks until the coordinator has gotten enough successful responses back to satisfy the consistency level.

• If a replica is down it will miss the write but Cassandra has mechanisms to ensure those replicas become consistent: hinted hand off, read repair and nodetool repair.

Page 20: Deep Dive into Cassandra

How Replicas Persist Data

• Replica receives a write request from a coordinator via Gossip.

• Writes an entry in its commit log.

• Adds a entry in the Memtable.

• Once a Memtable has exceeded a threshold it is flushed to disk in a file called an SSTable.

Page 21: Deep Dive into Cassandra

SSTables• SSTables = Sorted String Tables

• Only contain data for a single Cassandra table.

• Sorted by row key and column key.

• Immutable, once written they never change.

• By default, contents are compressed using Snappy.

• A replica will often have several SSTable files for a given CQL table.

Page 22: Deep Dive into Cassandra

How Our Data Looks in a SSTable Filealice email = [email protected]

first_name = Alice

last_name = Smith

bob email = [email protected]

first_name = Bob

• Partition key identifies the row.

• Column key is the name of the column.

• All sorted on row key and column key.

• We didn’t specify a value for Bob’s last_name so it simply isn’t there.

Page 23: Deep Dive into Cassandra

Querying for Users by Key

cqlsh:my_keyspace> SELECT * FROM users WHERE username = 'alice';

username | email | first_name | last_name ----------+-------------------+------------+----------- alice | [email protected] | Alice | Smith

(1 rows)

Page 24: Deep Dive into Cassandra

Anatomy of a Read• Replica gets read request.• Bloom filter identifies which

SSTables might contain data for the row key.

• For each SSTable file:‣ Gets SSTable offset for row key

from either partition key cache or partition index.

‣ Reads relevant portion of SSTable file sequentially.

• Sends merged results to coordinator.

Page 25: Deep Dive into Cassandra

Querying for Users by Email

SELECT * FROM users WHERE email = '[email protected]'; code=2200 [Invalid query] message="No indexed columns present in by-columns clause with Equal operator"

Can’t query against a non-primary key column because that is all that is indexed:

Page 26: Deep Dive into Cassandra

Secondary Indexescqlsh:my_keyspace> CREATE INDEX ON users (email);

cqlsh:my_keyspace> SELECT * FROM users WHERE email = '[email protected]';

username | email | first_name | last_name ----------+-------------------+------------+----------- alice | [email protected] | Alice | Smith

(1 rows)

Page 27: Deep Dive into Cassandra

Deleting Datacqlsh:my_keyspace> SELECT COUNT(*) FROM users WHERE username = 'bob';

count ------- 1 cqlsh:my_keyspace> DELETE FROM users WHERE username = 'bob'; cqlsh:my_keyspace> SELECT COUNT(*) FROM users WHERE username = 'bob';

count ------- 0

Page 28: Deep Dive into Cassandra

Tombstones• Deletes don’t actually delete data immediately.

• They write a “tombstone” to the replicas that marks those columns as having been deleted.

• When performing a read, tombstones will be found via the normal read process and a null value returned to the client.

• The data actually gets “deleted” when SSTable files with a tombstone get compacted with other SSTable files containing data for that column.

Page 29: Deep Dive into Cassandra

Compaction• Compaction is the process of merging SSTables in to one and deleting the old ones.• Two types of compaction: major and minor.• Major compactions:

‣ Triggered by running nodetool compact.‣ Compacts all SSTable files on a node in to one big SSTable file.‣ Should be avoided as it makes it unlikely minor compactions will occur.

• Minor compactions:‣ Happen automatically in the background.‣ How minor compactions work depends on the compaction strategy and its settings

specified in CREATE TABLE.‣ Size Tiered: SSTables get compacted in tiers based on their size.‣ Date Tiered: SSTables get compacted in tiers based on time window they cover.‣ Leveled: Ensures data for a row key is not overlapped in SSTables. Good for reads.

Page 30: Deep Dive into Cassandra

A Example Use Case• We are running a e-commerce site that sells things.

• Need to be able to record interaction with our site for future analysis.

• Browser side code will send us page view, hover, click and other events.

• Our job is to model the Cassandra tables and queries the server side component persists events in.

Page 31: Deep Dive into Cassandra

An Event TableCREATE TABLE events ( username text, time timestamp, type text, params map<text, text>, PRIMARY KEY ((username), time) );

• The username column acts as the “partition key”.

• The time column acts as the “cluster key”. Can perform range queries against it.

• The params column is a map that will contain event specific properties.

Page 32: Deep Dive into Cassandra

Event Table Write Examples

INSERT INTO events (username, time, type, params) VALUES ('alice', '2015-02-24 08:05:03-0600', 'hover', { 'product_id': 'regular-widget' } ); INSERT INTO events (username, time, type, params) VALUES ('alice', '2015-02-24 08:05:15-0600', 'hover', { 'product_id': 'mega-widget' } );

Alice goes to the homepage:INSERT INTO events (username, time, type, params) VALUES ('alice', '2015-02-24 08:05:01-0600', 'page_load', { 'url': 'http://example.com/' } );

INSERT INTO events (username, time, type, params) VALUES ('alice', '2015-02-24 08:05:24-0600', 'page_load', { 'url': 'http://example.com/super-mega-widget', 'product_id': 'super-mega-widget' } );

Alice hovers some product links:

Alice goes to the Super Mega Widget product page:

Page 33: Deep Dive into Cassandra

Querying Events• Count all the events for Alice:cqlsh:my_keyspace> SELECT COUNT(*) FROM events WHERE username = 'alice';

count ------- 4

• Find all events for Alice within a time range:cqlsh:my_keyspace> SELECT * FROM events WHERE username = 'alice' AND time >= '2015-02-24 08:05:03-0600' AND time <= '2015-02-24 08:05:21-0600';

username | time | params | type ----------+--------------------------+----------------------------------+------- alice | 2015-02-24 14:05:03+0000 | {'product_id': 'regular-widget'} | hover alice | 2015-02-24 14:05:15+0000 | {'product_id': 'mega-widget'} | hover

Page 34: Deep Dive into Cassandra

What CQL Does Not Do• Joins

‣ Doing a join on tables partitioned across many nodes is too expensive.

‣ Instead, you should attempt to denormalize your data model.

• Subqueries

‣ Cassandra sticks to one thing: storing and retrieving key partitioned data at scale.

‣ Things you might use a subquery for are usually solved by denormalizing and/or having an app specific data layer do the heavy lifting.

• Group By

‣ Aggregate datasets are usually pre-computed and stored in their own tables.

‣ Probably want to use Spark Streaming/Storm/etc to get sliding windows.

Page 35: Deep Dive into Cassandra

“Real Time” Aggregation• Lets say we need to graph all product page views in “real time”.

• Each data point could be a five minute aggregate, or whatever.

• What might the table look like that Spark Streaming/Storm/etc persist to?

CREATE TABLE event_time_series ( date text, time timestamp, product_id_page_views map<text, int>, PRIMARY KEY ((date), time) );

Ensures data points get distributed across the cluster.

Allows time based range queries within a day.

Page 36: Deep Dive into Cassandra

Analytics Options• Hadoop

‣ Painful to setup, best to use DataStax Enterprise.

‣ Map reduce, Hive and Pig are all supported to varying degrees.

‣ All but deprecated within much of the Cassandra community.

• Apache Spark

‣ Does not require DSE.

‣ Use the DataStax Spark Cassandra Spark connector.

‣ Highly recommend Spark for doing analytics with Cassandra.

Page 37: Deep Dive into Cassandra

Analytics Best Practices• Ensure good data locality by running Hadoop/Spark on the same

nodes that run Cassandra.

• Keep data centers doing OLAP separate from those doing OLTP.

• Spend some extra time ensuring Hadoop/Spark use system resources effectively without trampling each other.

‣ A resource manager like YARN or Mesos could help.

Page 38: Deep Dive into Cassandra

Administrivia• Each node should run nodetool repair -pr on a regular basis to ensure

decent consistency.

• Use NTP to ensure clocks on all nodes are accurate.

• Data directory should always be on local (preferably SSD) storage, never a SAN.

• Cassandra can do JBoD or you can RAID 0. RAID levels above 0 unneeded.

• Compaction requires there to be extra available storage capacity (50% [worst case] for tiered compaction, 10% for leveled compaction).

• Read and writes ops/sec should scale linearly. Two nodes = 2x throughput, four nodes = 4x, etc.

Page 39: Deep Dive into Cassandra

Predictable Linear Scaling

Page 40: Deep Dive into Cassandra

More Info on Cassandra• DataStax Developer

http://www.datastax.com/dev

• Planet Cassandra http://planetcassandra.org/

• Beware of old documentation, a lot has changed.

‣ Stick to CQLv3 in particular.

‣ Avoid the Thrift API.