cassandra community webinar: mysql to cassandra - what i wish i'd known

Hindsight is 20/20:MySQL to Cassandra

Michael Kjellman (@mkjellman)Barracuda Networks

What I Do

• Build and maintain “real-time” Spam detection and Web Filter classification

• Java/Perl/C (and bits of everything else)• Author perlcassa (Perl C* client)• Frontend? Backend? Customer? Internal?

Broken RAID Card? Bad Disk? I touch it all.

Our C* Cluster

• In production for ~2 years since 0.8• Running 1.2.5 + minor patches• 24 nodes in 2 datacenters• (2) 2TB Hard Drives (no RAID)• (1) Small SSD for small hot CFs• 64GB of RAM• Puppet for management• Cobbler for deployment• Target max load at 600GB per node

What is “real-time” exactly?

Our Rewrite by the NumbersCassandra Based

MySQL Based

Average Application Latency

2.41ms 5.0ms

Elements in Database

32,836,767 3,946,713

Elements Application Handles

32,836,767 314,974

Element Seen Prior to Tracking

1st request Various Thresholds

Datacenters 2 1

Average Latency of Automated Classification

3 seconds 8 minutes

Should you Rewrite?

• How To Survive a Ground-Up Rewrite Without Losing Your Sanity[1] – Joel Spolsky

• Past engineering decisions preventing implementation of new business requirements

• New threats smarter and more targeted

[1]http://onstartups.com/tabid/3339/bid/97052/How-To-Survive-a-Ground-Up-Rewrite-Without-Losing-Your-Sanity.aspx

Evolving Legacy Systems

• Even good developers can write sloppy code• Too much duct tape– Most layers applied around the database

Hitting the Reset Button

• Plan for continuous failure• Easily Scalable• No Single Point of Failure – that you know of • Many smaller boxes vs. one monolithic box

Whiteboard to Reality

• Get technical buy-in from all parties• Migrate and rewrite in stages– Business requirements forced hybrid period with

the old and new systems operated in parallel

Cassandra is Not…

1. Direct MySQL replacement2. Magic bullet to solve everything

Migrating

• Painful• Painful• Painful• Tons of rewriting• Tons of regressions• Did I say painful?

So Why Migrate?

• C* is the best option for persistence tier• Business success motivation• Don’t let your database hold you back

Lessons Learned (the good)

• Carefully defining data model up front• Creating a flexible systems architecture that

adapts well to changes during implementation• Seriously – “Measure twice, cut once.”

Lessons Learned (the bad)

• Consider migration and delivery requirements from the very beginning

• Adjust expectations – didn’t expect relying on legacy systems for so long

• Make syncing data between systems a priority

1. Start with the queries2. Think differently regarding reads3. Syncing and migrating data4. Don’t use C* as a queue5. Estimate capacity6. Automate, Automate, Automate7. Some maintenance required

1. Start with the Queries

• C* != “#dontneedtothinkaboutmyschema”• Counters and Composites• Optimize for use case– Don’t be afraid of writes. Storage is cheap. – Optimize to reduce the number of tombstones

2. Think Differently Regarding Reads

• Do you really need all that data at once?• mysql> SELECT * FROM mysupercooltable WHERE foo = ‘bar’;– Slow, but eventually will work

• cqlsh> SELECT * FROM myreallybigcf WHERE foo = ‘bar’;– Won’t work. Expect RPC timeout exceptions on reads generally

after ~10,000 rows even with paging• Our solutions:

– ElasticSearch– Hadoop/Pig

3. Syncing and Migrating Data

• Sync and migration scripts – take more seriously than production code

• Design sync to be continuous with both systems running in parallel during migration

• Prioritize the sync

4. Don’t use C* as a Queue

• Cassandra anti-patterns: Queues and queue-like datasets[2] – Aleksey Yeschenko

• Tombstones + read performance• Our solution: – Kafka (multiple publisher, multiple consumer

durable queue)

[2]http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets

5. Estimate Capacity

• Don’t forget the Java heap (8GB Max)• Plan capacity – today and future• Stress Tool – profile node and multiply• MySQL hardware != Cassandra hardware• New bottlenecks thanks to C* being so

awesome?• I/O still an important concern with C*

6. Automate, Automate, Automate

• Love your inner Ops self. Distributed systems move complexity to operations.

• Puppet or something similar (really)• Learn CCM earlier rather than later– www.github.com/pcmanus/ccm

7. Some Maintenance Required

• Repairs & Cleanup ops– automate and run frequently

• Rolling restart meet rolling repair

• Learn jconsole• Solution:– Jolokia (JMX via HTTP)

Where is Barracuda Today?

• 2 years in production with Cassandra• Definitely the right choice for our persistence

tier• 2 product lines on C* based system and

another major product in beta• Achieved “real-time” response

2.0 and Beyond

• Thrift -> CQL• CQL helps the MySQL to C* migration – Easier to comprehend / grasp

• Everyone understands SELECT * FROM cf WHERE key = ‘foo’;

• CAS and other 2.0 features make C* an even better replacement option for MySQL

C* Community

• Supercalifragilisticexpialidocious community!• Riak, HBase, Oracle are other options. How is

their dev community?• Great client support. Great people. Great

motivated developers.• IRC: #cassandra on freenode• Mailing List: user@cassandra.apache.org

cassandra community webinar: mysql to cassandra - what i wish i'd known

dont use c

queries c

new systems

mysql hardware

author perlcassa perl

capacity dont

data sync

syncing data

Technology

apache cassandra in action - o'reilly...

cassandra@coursera: aws deploy and mysql transition

partners for success · oracle db mongodb cassandra mysql...

an analysis of relational database and nosql database on...

paris cassandra meetup - cassandra for developers

introduction to hbase - distributed...

running cassandra on amazon’s ecs -...

cassandra community webinar: apache cassandra internals

foglight for cross-platform databases · mysql, postgresql,...

mysql to cassandra: big data, high scale, data migration......

nosql databases and...

cassandra core concepts - cassandra day toronto

databases for mediation...

cabs, cassandra, and hailo (at cassandra eu)

cassandra day atlanta 2015: python & cassandra

cassandra administration for mysql dbas - … · cassandra...

cassandra at ebay - cassandra summit 2012

solr & cassandra: searching cassandra with datastax...

realtime analytics with apache cassandra · realtime...

cassandra day nyc - cassandra anti patterns