building antifragile applications with apache cassandra
DESCRIPTION
Even with the best infrastructure, failures will occur without warning and are almost guaranteed. Building applications that can resist this fact of life can be both art and science. In this talk, I'll try to eliminate the art portion and focus more on the science. Starting at high level architecture decisions, I will take you through each layer and finally down to actual application code. Using Cassandra as the back end database, we can build layers of fault tolerance that will leave end users completely unaware of the underlying chaos that could be occurring. With a little planning, we can say goodbye to the Fail Whale and the fragility of the traditional RDBMS. Topics will include: - Application strategies to utilize active-active, diverse, datacenters - Replicating data with the highest integrity and maximum resilience - Utilizing Cassandra's built-in fault tolerance - Architecture of private, cloud or hybrid based applications - Application driver techniques when using CassandraTRANSCRIPT
@PatrickMcFadin
Patrick McFadinSolutions Architect, DataStax
Building Antifragile Applications with Apache Cassandra
1Wednesday, August 21, 13
Who I am
2
• Patrick McFadin• Solution Architect at DataStax• Cassandra MVP• User for years• Follow me for more:
I talk about Cassandra and building scalable, resilient apps ALL THE TIME!
@PatrickMcFadin
Wednesday, August 21, 13
Background - Why are we doing this?
•We live in an always-on society• Data driven applications rule the day (for now)• Failure as reality
3
•Mike Christian-Yahoo! Director of Engineering, Infrastructure Resilience• Frying Squirrels and unspun gyros
Wednesday, August 21, 13
Background - Antifragile as a practice• Antifragile: Things That Gain From Disorder• Nassim Nicholas Taleb• Things that get better with a little chaos
4
• Jesse Robbins• Master of Disaster at Amazon
• Bringer of “Game Day”
Wednesday, August 21, 13
Background - Distributed in a global economy
• Closer to your users == happy users• Latency is just physics• Best chance of light in fibre from US East to US West? 20ms
5
From To Latency
New York London 75.07
New York Rio De Janeiro 110.28
San Francisco Tokyo 97.33
Singapore Los Angeles 183.19
Tokyo London 242.88
Wednesday, August 21, 13
Challenges - When it was just HTTP
• Roaring 90s. Just static HTTP • Some of it was data driven, but most not• One awesome web server
6Wednesday, August 21, 13
Challenges - More than one web server?
• Pentium 2 web server wasn’t going to do it.•More than one web server? Wow• Distribute the HTML and then...?
7Wednesday, August 21, 13
Challenges - Spreading the load
• Clients have to find your now spread content• Round Robin DNS to the rescue!• Crazy router hacks• Hardware based load balancers• Resilient to failure
8Wednesday, August 21, 13
Challenges - Content Distribution Networks
• Cat pictures suck bandwidth• Cat videos suck even MORE bandwidth• User experience depends on response time• Content closer to users? Sweet
9Source: http://www.paulund.co.uk/content-delivery-network-review
Wednesday, August 21, 13
Challenges - What about data?
• Databases have been designed as singletons (ACID)•Master - Slave replication dominates• Sharding - No more joins• Distributed replication is hard if not impossible
10
Panic!!
Wednesday, August 21, 13
Challenges - Facebook and MySQL
• Heavily sharded but in one DC• Needed a second data center• 2007 opted for slave DBs• Re-wrote query parser and “hijacked the MySQL replication stream”• Transparent to application
11
Ref: https://www.facebook.com/note.php?note_id=23844338919&id=9445547199&index=0
Wednesday, August 21, 13
Challenges - Finally some science!
• 2007 - Amazon and the Dynamo paper• Attempted to answer:• How do we distribute our data?• How do we maximize uptime?• How can we keep our data safe?
• Consolidated distributed database science• 24 research papers cited• Almost 30 years of distributed computing thought
12
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Wednesday, August 21, 13
So now what?
13
Let’s put it all together
•More than one server - Uptime and Scaling• Get closer to users - Reduce latency.• Transparent - Make it a natural part of the system.
Wednesday, August 21, 13
Techniques - Step one
• Control your traffic
14Wednesday, August 21, 13
Techniques - Hardware traffic control
• F5 GTM - Global Traffic Manager• A10 Thunder• Citrix Netscaler• Cisco ACE• Barracuda ADC
15
Source: http://www.f5.com
Wednesday, August 21, 13
Techniques - Service based control
• DYN - Active failover service• Akamai - Global Traffic Management• Amazon- Route 53
16Wednesday, August 21, 13
Techniques - DIY control
• Client level awareness• Client chooses which path to take
17
http://imaginethefutur.blogspot.com/2012/09/javascript-load-balancer-using-cookies.html
A good example
Wednesday, August 21, 13
Techniques - Step 2
•Make your app ok with being part of a team• Fail fast - Short circuits• Transactions are expensive• Partial or eventual data is ok
18
Great discussion on this topic:
http://www.planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-consistency-by-christos-kalantzis
Wednesday, August 21, 13
Techniques - Application Architecture
• Embrace the pod (or application units)• Each stands alone
19
East Coast10000 Customers
West coast10000 Customers
Next deployable unit10000 Customers
Wednesday, August 21, 13
Techniques - State management
• State in the browser?• State in the data layer?• State in both?
20
No!
Wednesday, August 21, 13
Techniques - Step 3
•Make your persistence layer resilient• If your app layer can fail, then why not the DB?•Master-Master - Less complexity• Of course I’m talking about Cassandra
21
Same data. Fully replicated.
Wednesday, August 21, 13
Techniques - Step 4
• Test!! Test!! Test!!
22
Know these guys?
• Constantly breaking things• Chaos Monkey - Shut down random services• Always be failing so it’s normal• Read this: http://queue.acm.org/detail.cfm?
id=2499552
Wednesday, August 21, 13
Cassandra - Intro
• Based on Amazon Dynamo and Google BigTable paper• Shared nothing• Data safe as possible• Predictable scaling
23
Dynamo
BigTable
Wednesday, August 21, 13
Cassandra - More than one server
• All nodes participate in a cluster• Shared nothing• Add or remove as needed•More capacity? Add a server
24Wednesday, August 21, 13
Cassandra - Locally Distributed
• Client writes to any node• Node coordinates with others• Data replicated in parallel• Replication factor: How many
copies of your data?• RF = 3 here
25Wednesday, August 21, 13
Cassandra - Geographically Distributed
• Client writes local• Data syncs across WAN• Replication Factor per DC
26Wednesday, August 21, 13
Cassandra - Consistency
• Consistency Level (CL)• Client specifies per read or write
27
• ALL = All replicas ack• QUORUM = > 51% of replicas ack• LOCAL_QUORUM = > 51% in local DC ack• ONE = Only one replica acks
Wednesday, August 21, 13
Cassandra - Transparent to the application
• A single node failure shouldn’t bring failure• Replication Factor + Consistency Level = Success• This example:• RF = 3• CL = QUORUM
28
>51% Ack so we are good!
Wednesday, August 21, 13
Cassandra Applications - Drivers
• DataStax Drives for Cassandra• Java• C#• Python•more on the way
29Wednesday, August 21, 13
Cassandra Applications - Connecting
• Create a pool of local servers• Client just uses session to interact with Cassandra
30
contactPoints = {“10.0.0.1”,”10.0.0.2”}
keyspace = “videodb”
public VideoDbBasicImpl(List<String> contactPoints, String keyspace) {
cluster = Cluster .builder() .addContactPoints(! contactPoints.toArray(new String[contactPoints.size()])) .withLoadBalancingPolicy(Policies.defaultLoadBalancingPolicy()) .withRetryPolicy(Policies.defaultRetryPolicy()) .build();
session = cluster.connect(keyspace); }
Wednesday, August 21, 13
Cassandra Applications - Load balancing• Token aware - Request sent to primary node with data• Calls can be asynchronous and in parallel
31
1
23
45
6Client
Thread
Node
Node
Node
ClientThread
ClientThread
Node
Driver
Wednesday, August 21, 13
Cassandra Applications - Fault tolerance
• Try first with a Consistency Level of QUORUM• If fails, retry with Consistency Level ONE
32
Client Node
Node Replica
Replica
NodeReplica
Wednesday, August 21, 13
Application Example - Layout
• Active-Active• Service based DNS routing
33
Cassandra Replication
Wednesday, August 21, 13
Application Example - Uptime
34
• Normal server maintenance• Application is unaware
Cassandra Replication
Wednesday, August 21, 13
Application Example - Failure
35
• Data center failure• Data is safe. Route traffic.
33
Another happy user!
Wednesday, August 21, 13
Conclusion
36
• Cassandra is THE BEST persistence tier for your application• Plan for chaos. Inject your own. • Now go write your app.
1. The Data Model is Dead, Long Live the Data Model
http://www.youtube.com/watch?v=px6U2n74q3g
2. Become a Super Modeler
http://www.youtube.com/watch?v=qphhxujn5Es
3. The World's Next Top Data Model
http://www.youtube.com/watch?v=HdJlsOZVGwM
Data Modeling!
https://github.com/pmcfadin/cql3-videodb-example
Example code
Wednesday, August 21, 13
Thank you!
37
CALL FOR PAPERSSPONSORSHIP 30+ SessionsTWO DAYS TRAINING DAYCALL FOR PAPERS
SPONSORSHIP OPPORTUNITY
TWO DAYS30+ SESSIONS
TRAINING DAY
Oh yeah. We are hiring.
Q&A?
Wednesday, August 21, 13