making it to veteren cassandra status

24
MAKING IT TO VETERAN CASSANDRA STATUS Been There, Done That, Survived

Upload: eric-lubow

Post on 16-Apr-2017

354 views

Category:

Internet


0 download

TRANSCRIPT

MAKING IT TO VETERAN CASSANDRA STATUS

Been There, Done That, Survived

Eric Lubow @elubow

PERSONAL VANITY

๏ CTO of SimpleReach

๏ Co-Author of Practical Cassandra

๏ Skydiver, Mixed Martial Artist, Motorcyclist, Dog Dad (IG: @charliedognyc), NY Giants fan

Eric Lubow @elubow

SIMPLEREACH

๏ Identify the best content

๏ Use engagement metrics

๏ Stream processing ingest

๏ Many metrics, time sliced

๏ Multiple data stores

Eric Lubow @elubow

๏Started using Cassandra at 0.2 in Sep of 2009

๏First put Cassandra in production at 1.0

๏Helped in building multiple drivers

๏Filed lots of Jira tickets (40+)

๏Beta tested features

๏Large counter deployment (largest?)

AM I QUALIFIED TO BE A VETERAN

Eric Lubow @elubow

DID I MENTION I CO-WROTE A BOOK?

Eric Lubow @elubow

What are we actually going to talk about today?

Eric Lubow @elubow

HOW DOES ONE BECOME A VETERAN

It’s not all unicorns and rainbows

Eric Lubow @elubow

๏ Use Cassandra

๏ Dig in to the code from time to time (server and drivers)

๏ Know strengths and weaknesses and understand why

๏ Follow the changelogs and mailing lists

๏ Stress Cassandra in unconventional ways

๏ Learn the failure scenarios and how to fix them (hang out on IRC)

๏ Break the rules from time to time to see what happens

๏ “Those who do not know the past are condemned to repeat it.” -

George Santayana

HOW DO I LEVEL UP?

Eric Lubow @elubow

HOW DID SIMPLEREACH GET FROM …

Eric Lubow @elubow

๏ What’s the latest cool technology?

CHOOSING A DATABASE IS EASY, #AMIRITE

๏ What is my data volume?

๏ What are my query patterns?

๏ Is my data (un)structured?

๏ Will data remain consistent?

๏ Am I read heavy or write heavy?

๏ Am I batch loading data?

๏ Is eventually consistent data ok?

๏ Can I have a DR plan?

๏ Legal/compliance requirements?

๏ Are there experts/enterprise support?

๏ What’s the community like?

๏ Easy to administer?

๏ Tooling, monitoring, language support?

๏ Cloud or iron?

๏ High volume ingestion or batch loading?

๏ Fault tolerance?

๏ Open source vs enterprise system?

๏ Employee learning curve vs. learning cost?

Eric Lubow @elubow

LET’S LOOK AT SOME USE-CASES

Eric Lubow @elubow

WRITE: High volume/High velocity ingestion

USE-CASE: READ/WRITE PATTERNS

๏ Log structured storage; fast writes

๏ Writes do not affect reads

๏ Row creation unaffected by table size

๏ Indexing does not affect writes

๏ No locking, uses vector clock/LWW

๏ Goals

๏ Document storage; slower writes

๏ MMAP reads affected by writes

๏ Slow document creation in large

collections

๏ Poor indexing can destroy entire DB

๏ Server level, db level, collection level locks

๏ Goals

READ: Recency, key/value lookups, ETL

Cassandra Mongo

Eric Lubow @elubow

HELPERS FOR A MORE AFFORDABLE CLUSTER

Aggregator

Mongo Writer

Broadcast

Redis Writer

Cassandra Writer

Solr Writer

Calculator

NSQ

Vertica Writer

Eric Lubow @elubow

HOW DO WE KNOW WHAT WORKS BEST

Eric Lubow @elubow

USE-CASE: ADMINISTRATION

๏ Every node is the same base

๏ No master node

๏ All monitoring through JMX

๏ One step to add/remove nodes

๏ Tunables, lots of em

๏ Easily wrote our own chef cookbook

๏ Goals

๏ Config nodes, Shard nodes, Replica nodes

๏ Master/slave nodes, leader election

๏ Monitoring via mongostat sometimes

๏ Two step to add/remove nodes

๏ No tunables

๏ Many non-well working chef cookbooks

๏ Goals

BASICALLY JUST ME

Cassandra Mongo

Eric Lubow @elubow

๏ Primarily Datastax ๏ Community

Contributions

๏ Who is the

community?

CASSANDRA IS OPEN SOURCE

Eric Lubow @elubow

SERIOUSLY 40+ JIRA TICKETS?

SPARK-6949 Pyspark and datetime

OPSC-6186 Rebalance - while calling decorator (IndexError): list index out of range

CASSANDRA-9871 Cannot replace token does not exist - DN node removed as Fat Client

OPSC-6045 Agent CPU on startup 800 Seconds

OPSC-5346 Opsc Repair service system_traces system_auth

CASSANDRA-7409 LCS improvement

CASSANDRA-8611 Socket timeout shitty default

CASSANDRA-9279 Gossip (and mutations) lock up on Startup

OPSC-4879 OpsC Agent JMX Connections and Cassandra Operations Fail Incessantly

CASSANDRA-8086 Too many connections - Cassandra Defense

CASSANDRA-7122 System peers

CASSANDRA-6506 Counters++ Final Performance

CASSANDRA-7510 Up node gossip messages -- affects drivers

PYTHON-202 More control for metadata updates

PYTHON-201 Optionally randomize contact points

OPSC-3672 OpsC - Repair Service Restarts on Node Flopping

DSP-3059 / SOLR-5463 Solr 4.10 - and Deep Paging

CASSANDRA-8548 Cleanup Dump

DSP-4560 Possible ticket Upgrade from 4.5.2 to 4.5.3

DSP-3341 In-memory Phase 2 (off heap and remove GB limit)

DSP-3970 Solr indexes even when values don't change

CASSANDRA-8150 Stump's JVM Tuning

Eric Lubow @elubow

SIMPLEREACH CONTEXT

๏ 100 million URLs

๏ 350 million Tweets

๏ 50k - 100k events per second (tens of billions of events per day)

๏ 225G new per hour

๏ 700T of total data (10T per month)

๏ 10T of hot data

๏ 72 nodes Cassandra cluster

๏ 52 Realtime Nodes

๏ 9 Search Nodes

๏ 11 Spark Nodes

Eric Lubow @elubow

Solr

Solr

Vertica + Cassandra

Vertica + Cassandra

Vertica

Mongo

Eric Lubow @elubow

๏ Average over 200k counter writes per second

๏ Pre-aggregate writes (saved us 10x the writes)

๏ Trying to defeat the counter time bomb

๏ Breaking the rules with CASSANDRA-8150

๏ Many many JVM tuning changes

๏ All things possible through monitoring

๏ Upgraded every node in the cluster by hand one at a time

๏ Upgrading to 2.1 definitely sealed the deal

CONQUERING COUNTERS

Eric Lubow @elubow

๏ Nodes might have removed themselves from a cluster because the

disk was full

๏ Apps might lose connections to the cluster and then take 45 min to

reconnect (or longer on bigger clusters)

๏ A slow node might make the entire cluster unusable

๏ A poorly gossiping node might overwork itself out of the cluster

๏ Adding a node to the cluster might take down all connected apps

๏ Sometimes you just can’t removenode (or bootstrap)

UNDERSTAND FAILURE SCENARIOS

Eric Lubow @elubow

WHAT SHOULD YOU WALK AWAY WITH?

๏ Incredibly important to have a deep

understanding around your cases

๏ Sometimes database tuning has nothing to do

with database settings

๏ Understand failure scenarios for your use-cases

๏ Give back, it helps everything get better

๏ Ignoring best practices is almost never a good

idea

Eric Lubow @elubow

THANKS FOR LISTENING

Eric Lubow @elubow

QUESTIONS IN LIFE ARE GUARANTEED,

ANSWERS AREN’T.

Eric Lubow

@elubow

NYC Cassandra Day