understanding failure domains

14
Thinking Like An Architect: Understanding Failure Domains By: Eric Wright

Upload: vmturbo

Post on 17-Aug-2015

36 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Understanding Failure Domains

Thinking Like An Architect:Understanding Failure Domains

By: Eric Wright

Page 2: Understanding Failure Domains

Let’s Talk Architecture

Image by telmo32

Page 3: Understanding Failure Domains

Failure DomainsThere are tons of considerations when building any type of architecture.

It doesn’t matter if you’re building a software application or the underlying infrastructure, failure domains are critical.

Quick again: what are failure domains?

They’re regions or components of the infrastructure with the potential for failure.

The regions can be physical or logical, and each region has unique risks and challenges.

3

Page 4: Understanding Failure Domains

Let’s Keep It Simple

Scenario: you’re running a web application with a single Apache server and a MySQL database on two servers.

Here are your risks:

Web server: running a single instance of your web server

Database server: single instance risks loss when the application is potentially unable to attach to database

Network: we were smart enough to separate web and database server. But… that introduces another point of failure.

Simple to see, but what should we do?

4

Photo by JD Hancock

Page 5: Understanding Failure Domains

Don’t Hesitate, Mitigate

Migration is the reduction of risk by some action or design.

Let’s walk through the top strategies for web servers, database servers, and networks.

5

Page 6: Understanding Failure Domains

Web Server MitigationAdding more web servers to handle the requests provides:

• Redundancy• Resiliency

Add a load balancer into the application infrastructure.

This will allow it to accept inbound connections and distribute the requests across the server farm.

6

Page 7: Understanding Failure Domains

Database Server Mitigation

7

To allow for failures of certain nodes:

You need a horizontally scalable database architecture.

It also ensures data availability during localized outages.

Luckily MySQL can be deployed this way with MariaDB, a distributed relational database to allow for multi-node installations.

Page 8: Understanding Failure Domains

Network Mitigation

8

First, we can add multiple network cards and attach the uplink ports to multiple switches.

This lets us withstand a rack switch outage, a single port outage, or even a simple cable failure.

At the networking layer: make sure the right failsafe designs are in place.

This will prevent routing issues, switch issues, and multiple uplinks to the external network provider for better resiliency.

But… what is the impact of all these solutions?

Page 9: Understanding Failure Domains

Have You Ever Heard This Joke?

9

I had a problem that I decided to use Regex statements to fix.

Now I have 2 problems.

Hilarious, right?

Page 10: Understanding Failure Domains

What Can Happen

10

Adding some more web servers looks easy.

But web farms assume you have a queuing system in your database when you’re doing write functions.

We fixed a single point of failure, but introduced complexity.

This is a key reason why we focus on DevOps concepts.

It’s also why we have the infrastructure and application teams both fully engaged in architecture decisions.

Page 11: Understanding Failure Domains

We’re Finished Finally….Right?

11

You’ve added:

• New servers• Load balancers• Message queuing infrastructure

But what happens if there’s a regional power outage or network outage?

We’re not covered.

Page 12: Understanding Failure Domains

Don’t Fall Victim To Analysis Paralysis

12

There will never be one ultimate solution.

Hopefully, your team loves agile and lean processes so you can try something and iterate in the face of deficiencies and failure domain mitigation.

And for our latest power outage problem?

You can use servers outside your geographical region, or the cloud, or multiple clouds!

Page 13: Understanding Failure Domains

Here’s The Point

13

Nobody wants to get caught when the outage occurs and say, “Oh, I didn’t think of that!”

Be acutely aware of failure domains and scenarios when architecting a solution.

Page 14: Understanding Failure Domains

Author

Eric WrightPrincipal Solutions ArchitectVMTurbo

14