building antifragile applications with apache cassandra

@PatrickMcFadin

Patrick McFadinSolutions Architect, DataStax

Building Antifragile Applications with Apache Cassandra

1Wednesday, August 21, 13

Who I am

2

• Patrick McFadin• Solution Architect at DataStax• Cassandra MVP• User for years• Follow me for more:

I talk about Cassandra and building scalable, resilient apps ALL THE TIME!

@PatrickMcFadin

Wednesday, August 21, 13

Background - Why are we doing this?

•We live in an always-on society• Data driven applications rule the day (for now)• Failure as reality

3

•Mike Christian-Yahoo! Director of Engineering, Infrastructure Resilience• Frying Squirrels and unspun gyros


Background - Antifragile as a practice• Antifragile: Things That Gain From Disorder• Nassim Nicholas Taleb• Things that get better with a little chaos

4

• Jesse Robbins• Master of Disaster at Amazon

• Bringer of “Game Day”


Background - Distributed in a global economy

• Closer to your users == happy users• Latency is just physics• Best chance of light in fibre from US East to US West? 20ms

5

From To Latency

New York London 75.07

New York Rio De Janeiro 110.28

San Francisco Tokyo 97.33

Singapore Los Angeles 183.19

Tokyo London 242.88


Challenges - When it was just HTTP

• Roaring 90s. Just static HTTP • Some of it was data driven, but most not• One awesome web server


Challenges - More than one web server?

• Pentium 2 web server wasn’t going to do it.•More than one web server? Wow• Distribute the HTML and then...?


Challenges - Spreading the load

• Clients have to find your now spread content• Round Robin DNS to the rescue!• Crazy router hacks• Hardware based load balancers• Resilient to failure


Challenges - Content Distribution Networks

• Cat pictures suck bandwidth• Cat videos suck even MORE bandwidth• User experience depends on response time• Content closer to users? Sweet

9Source: http://www.paulund.co.uk/content-delivery-network-review


Challenges - What about data?

• Databases have been designed as singletons (ACID)•Master - Slave replication dominates• Sharding - No more joins• Distributed replication is hard if not impossible

10

Panic!!


Challenges - Facebook and MySQL

• Heavily sharded but in one DC• Needed a second data center• 2007 opted for slave DBs• Re-wrote query parser and “hijacked the MySQL replication stream”• Transparent to application

11

Ref: https://www.facebook.com/note.php?note_id=23844338919&id=9445547199&index=0


https://www.facebook.com/note.php?note_id=23844338919&id=9445547199&index=0

https://www.facebook.com/note.php?note_id=23844338919&id=9445547199&index=0

Challenges - Finally some science!

• 2007 - Amazon and the Dynamo paper• Attempted to answer:• How do we distribute our data?• How do we maximize uptime?• How can we keep our data safe?

• Consolidated distributed database science• 24 research papers cited• Almost 30 years of distributed computing thought

12

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html




So now what?

13

Let’s put it all together

•More than one server - Uptime and Scaling• Get closer to users - Reduce latency.• Transparent - Make it a natural part of the system.


Techniques - Step one

• Control your traffic


Techniques - Hardware traffic control

• F5 GTM - Global Traffic Manager• A10 Thunder• Citrix Netscaler• Cisco ACE• Barracuda ADC

15

Source: http://www.f5.com


Techniques - Service based control

• DYN - Active failover service• Akamai - Global Traffic Management• Amazon- Route 53


Techniques - DIY control

• Client level awareness• Client chooses which path to take

17

http://imaginethefutur.blogspot.com/2012/09/javascript-load-balancer-using-cookies.html

A good example




Techniques - Step 2

•Make your app ok with being part of a team• Fail fast - Short circuits• Transactions are expensive• Partial or eventual data is ok

18

Great discussion on this topic:

http://www.planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-consistency-by-christos-kalantzis






Techniques - Application Architecture

• Embrace the pod (or application units)• Each stands alone

19

East Coast10000 Customers

West coast10000 Customers

Next deployable unit10000 Customers


Techniques - State management

• State in the browser?• State in the data layer?• State in both?

20

No!


Techniques - Step 3

•Make your persistence layer resilient• If your app layer can fail, then why not the DB?•Master-Master - Less complexity• Of course I’m talking about Cassandra

21

Same data. Fully replicated.


Techniques - Step 4

• Test!! Test!! Test!!

22

Know these guys?

• Constantly breaking things• Chaos Monkey - Shut down random services• Always be failing so it’s normal• Read this: http://queue.acm.org/detail.cfm?

id=2499552


http://queue.acm.org/detail.cfm?id=2499552




Cassandra - Intro

• Based on Amazon Dynamo and Google BigTable paper• Shared nothing• Data safe as possible• Predictable scaling

23

Dynamo

BigTable


Cassandra - More than one server

• All nodes participate in a cluster• Shared nothing• Add or remove as needed•More capacity? Add a server


Cassandra - Locally Distributed

• Client writes to any node• Node coordinates with others• Data replicated in parallel• Replication factor: How many

copies of your data?• RF = 3 here


Cassandra - Geographically Distributed

• Client writes local• Data syncs across WAN• Replication Factor per DC


Cassandra - Consistency

• Consistency Level (CL)• Client specifies per read or write

27

• ALL = All replicas ack• QUORUM = > 51% of replicas ack• LOCAL_QUORUM = > 51% in local DC ack• ONE = Only one replica acks


Cassandra - Transparent to the application

• A single node failure shouldn’t bring failure• Replication Factor + Consistency Level = Success• This example:• RF = 3• CL = QUORUM

28

>51% Ack so we are good!


Cassandra Applications - Drivers

• DataStax Drives for Cassandra• Java• C#• Python•more on the way


Cassandra Applications - Connecting

• Create a pool of local servers• Client just uses session to interact with Cassandra

30

contactPoints = {“10.0.0.1”,”10.0.0.2”}

keyspace = “videodb”

public VideoDbBasicImpl(List<String> contactPoints, String keyspace) {

cluster = Cluster .builder() .addContactPoints(! contactPoints.toArray(new String[contactPoints.size()])) .withLoadBalancingPolicy(Policies.defaultLoadBalancingPolicy()) .withRetryPolicy(Policies.defaultRetryPolicy()) .build();

session = cluster.connect(keyspace); }


Cassandra Applications - Load balancing• Token aware - Request sent to primary node with data• Calls can be asynchronous and in parallel

31

1

23

45

6Client

Thread

Node

Node

Node

ClientThread

ClientThread

Node

Driver


Cassandra Applications - Fault tolerance

• Try first with a Consistency Level of QUORUM• If fails, retry with Consistency Level ONE

32

Client Node

Node Replica

Replica

NodeReplica


Application Example - Layout

• Active-Active• Service based DNS routing

33

Cassandra Replication


Application Example - Uptime

34

• Normal server maintenance• Application is unaware

Cassandra Replication


Application Example - Failure

35

• Data center failure• Data is safe. Route traffic.

33

Another happy user!


Conclusion

36

• Cassandra is THE BEST persistence tier for your application• Plan for chaos. Inject your own. • Now go write your app.

1. The Data Model is Dead, Long Live the Data Model

http://www.youtube.com/watch?v=px6U2n74q3g

2. Become a Super Modeler

http://www.youtube.com/watch?v=qphhxujn5Es

3. The World's Next Top Data Model

http://www.youtube.com/watch?v=HdJlsOZVGwM

Data Modeling!

https://github.com/pmcfadin/cql3-videodb-example

Example code










Thank you!

37

CALL FOR PAPERSSPONSORSHIP 30+ SessionsTWO DAYS TRAINING DAYCALL FOR PAPERS

SPONSORSHIP OPPORTUNITY

TWO DAYS30+ SESSIONS

TRAINING DAY

Oh yeah. We are hiring.

Q&A?


building antifragile applications with apache cassandra

Technology