borg-like resilience for your microservices

35
Borg-like Resilience for Your Microservices Philip Lombardi Engineer

Upload: datawire

Post on 18-Jan-2017

274 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Borg-Like Resilience for Your Microservices

Borg-like Resilience for Your Microservices

Philip LombardiEngineer

Page 2: Borg-Like Resilience for Your Microservices

datawire.io

Background...

1. Philip Lombardi @ Datawire.io (twitter: @TheBigLombowski)

2. Datawire.io is building a Microservices Development Kit to enable developers to build resilient microservice applications.

3. Check us out after the talk: app.datawire.io

2

Page 3: Borg-Like Resilience for Your Microservices

datawire.io 3

What is a Microservice?

Page 4: Borg-Like Resilience for Your Microservices

datawire.io

Common Microservice Definitions

It’s a service that is...

● Small● Self contained● Narrow in scope● Bounded context● Independent● Loosely coupled

Sort of…

All these things describe attributes

4

Page 5: Borg-Like Resilience for Your Microservices

datawire.io

A (simpler) Microservices Definition

A Microservice is a unit of business logic.

A Microservice application is a distributed composition of business logic via services.

5

Page 6: Borg-Like Resilience for Your Microservices

datawire.io

Microservices Benefits and Tradeoffs

● Easier to reason about the individual components that make up the system.

● Easy to add new biz logic.

● More difficult to deploy than a classic monolith.

● More difficult to operate.

6

Page 7: Borg-Like Resilience for Your Microservices

datawire.io

Combine to build AWESOME!

7

“Death Star Topology”

Page 8: Borg-Like Resilience for Your Microservices

datawire.io

Awesome…, until someone puts a torpedo down the vent shaft!

8

Page 9: Borg-Like Resilience for Your Microservices

datawire.io

In reality, they rarely seem to explode...

1. When was the last time you can remember Netflix being down? Or Uber? Or Yelp? Twitter?

2. The actual Death Stars were brittle and had a clear single point of failure. But these Death Star topologies are NOT brittle.

9

Page 10: Borg-Like Resilience for Your Microservices

datawire.io

These systems are very resilient...

They survive whole classes of problems...

● Hardware and network issues● Software bugs● Security exploits

Engineers:

● Find and fix issues without stopping the system.● Add new features to the product the biz logic represents.● Alter the system multiple times a day causing the topology to shift and change

constantly.

10

Page 11: Borg-Like Resilience for Your Microservices

datawire.io

These systems are a lot more like The Borg

11

Page 12: Borg-Like Resilience for Your Microservices

datawire.io

The Borg...

1. A collective hive of drones (biz logic) that are loosely controlled by The Queen (orchestration, discovery).

2. Nearly impossible to stop:a. Routinely take on numerous adversaries (bugs, security threats).

b. Continue to make progression regardless of whether The Collective is unable to communicate with all Drones because of secondary objectives. (hardware failures, network outages).

c. Forced evolution by adopting best of breed technologies. What doesn’t kill them just makes them

stronger (continuous integration and improvement).

3. The Borg assimilate new cultures and tech to strengthen their operational efficiency and resiliency.

12

Page 13: Borg-Like Resilience for Your Microservices

datawire.io

How did these companies become Borg-like?

1. New Architecture!

2. The new architecture made them extremely resilient to infrastructure failure AND software bugs.

13

Page 14: Borg-Like Resilience for Your Microservices

datawire.io

Failure Types...

● There are the kind everyone always engineers for…

○ Network failures○ Server failures○ Storage failures

○ Resource Limits

● And then there are the kinds we often think we’re engineering for…

○ Integration Bugs○ Functional Bugs

14

Page 15: Borg-Like Resilience for Your Microservices

datawire.io

Integration Bugs… Your new worst enemy.

● In a Microservices app your biggest issue will be the integration bug.

● It’s nearly impossible at a certain app size to get a whole running system up and running to run integration tests.

● There’s also no compiler to save you from yourself and there is no type safety at service boundaries.

● Integration bugs are a way of life in a Microservices app because of how the system is decomposed into many small independent units.

15

Page 16: Borg-Like Resilience for Your Microservices

datawire.io

The new architecture

● Born from a need to be resilient to integration bugs

● Allowed adopters of the new architecture to move quickly as they found a way to be resilient to both infrastructure level issues AND software integration bugs.

16

Page 17: Borg-Like Resilience for Your Microservices

datawire.io

Routing table is (relatively) static

Routing policy is global

Traditional Architecture

17

Client

DNS

Load Balancer

Serverre

solv

e

traffic

Page 18: Borg-Like Resilience for Your Microservices

datawire.io

Traditional Architecture

18

Client

DNS

Load Balancer

Serverre

solv

e

traffic

Load Balancers are designed to protect against infrastructure failures first and foremost.

Page 19: Borg-Like Resilience for Your Microservices

datawire.io

Traditional Architecture

19

Client

DNS

Load Balancer

Serverre

solv

e

traffic

Infrastructure is NOT the biggest cause of bugs and system failure in 2016...

Page 20: Borg-Like Resilience for Your Microservices

datawire.io

Traditional Architecture

20

Client

DNS

Load Balancer

Server

reso

lve

traffic

Biggest issue is integration bugs...

1.1

1.0

1.0

1.0

Page 21: Borg-Like Resilience for Your Microservices

datawire.io

Traditional Architecture

21

Client

DNS

Load Balancer

Server

reso

lve

traffic

New service is returning faulty JSON / XML / CSV to the Client.

1.1

1.0

1.0

1.0

Page 22: Borg-Like Resilience for Your Microservices

datawire.io

Traditional Architecture

22

Client

DNS

Load Balancer

Server

reso

lve

traffic

LB thinks everything is OK. Client is exploding.

1.1

1.0

1.0

1.0

Page 23: Borg-Like Resilience for Your Microservices

datawire.io

But some smart folks figured out a better way...

● The architecture involves using “Smart Endpoints”.

● Each node is “smart” because:

○ Node knows how to communicate with every other node without a load balancer. Intelligence of a

central load balancer exists on each node.

○ When an integration bug happens the client node can blacklist the misbehaving service node and

still use the other set of nodes.

● The end result is a mesh of intelligent intercommunicating service nodes (like the Borg Collective).

23

Page 24: Borg-Like Resilience for Your Microservices

datawire.io

Smart Endpoints Architecture

24

Client

DiscoveryServer

heartbeatsro

utes

Smart Endpoint

Page 25: Borg-Like Resilience for Your Microservices

datawire.io

Smart Endpoints Architecture

25

Client

DiscoveryServer

heartbeatsro

utes

Smart Endpoint

Servers send their addresses to Discovery and periodic heartbearts which protects you against infrastructure issues.

Page 26: Borg-Like Resilience for Your Microservices

datawire.io

Smart Endpoints Architecture

26

Client

DiscoveryServer

heartbeatsro

utes

Smart Endpoint

Discovery pushes addresses to clients which keep the server addresses in a local hash table. Discovery is not a SPOF because it’s just a broker. Client owns its own independent routing table.

Page 27: Borg-Like Resilience for Your Microservices

datawire.io

Smart Endpoints Architecture

27

Client

Discovery

Server

heartbeatsro

utes

Smart Endpoint

In the Smart Endpoint model when a Client talks to our buggyservice and fails due to a software bug it blacklists the node!

1.1

1.0

1.0

1.0

Page 28: Borg-Like Resilience for Your Microservices

datawire.io

Smart Endpoints Architecture

28

Client

Discovery

Server

heartbeatsro

utes

Smart Endpoint

Failure is detected QUICKLY and while a tiny amount of traffic will still fail compared to the LB model it’s a tiny tiny amount.

1.1

1.0

1.0

1.0

Page 29: Borg-Like Resilience for Your Microservices

datawire.io

It’s mostly about Circuit Breakers...

29

Page 30: Borg-Like Resilience for Your Microservices

datawire.io

It’s mostly about Circuit Breakers...

● Smart Endpoints as an architecture works because of a tech called Circuit Breakers

● Nodes independently track usage of remote services when they encounter failure due to software or infrastructure then the remote service is blacklisted.

● It’s important to understand circuit breakers are local and not global. Each service in your system might have a different concept of “working”.

30

Page 31: Borg-Like Resilience for Your Microservices

datawire.io

Circuit Breakers are really powerful...

● Circuit breakers provide safety from both infrastructure AND software issues

● Timeouts, network partitions, and server failures are all transient bugs that come and go. Node can be temporarily blacklisted when a failure is due to an infrastructure issue.

● Software bug such as our aforementioned integration bug on the remote service is never going to be fixed. Node can be permanently blacklisted.

31

Page 32: Borg-Like Resilience for Your Microservices

datawire.io

Smart Endpoints In a Nutshell

Two big but very simple things!

1. Each service maintains its own record of addresses in the environment. A service can in theory talk to any other service.

2. Circuit breakers prevent catastrophic failures by blacklisting misbehaving nodes. Blacklisting is done on a per-node basis.

32

Page 33: Borg-Like Resilience for Your Microservices

datawire.io

Smart Endpoints Advantages...

● Smart Endpoints allow us to do quick integration testing because we don’t have to worry about catastrophic cascade failure (blacklist the misbehaving node and talk to known working nodes).

● Smart Endpoints make integration bugs far less dangerous and therefore enable faster development cycles.

● Still prevent classic infrastructure issues from being the downfall of your app.

● Do you see where this is going?

33

Page 34: Borg-Like Resilience for Your Microservices

datawire.io

Back to Borg-like Architecture...

● If Smart endpoints prevent catastrophic software failure then the following things become true:

○ Change becomes a way of life. Modifying the system with new services becomes something

developers, management and operations is comfortable with.

○ Change means new technology can be integrated, features added, bugs can be fixed and security

holes patched without fear that a million or billion dollar line of business application fails and

costs the company huge $$$.

○ The end result is a an application that is in a continual state of improvement.

34

Page 35: Borg-Like Resilience for Your Microservices

datawire.io

To learn more

Contact me:

[email protected]● Twitter: @TheBigLombowski

Jobs

● https://www.datawire.io/careers/○ Java, Python, Go and Kubernetes Developers Wanted!

Try

● https://github.com/datawire/mdk● https://www.datawire.io

35