borg-like resilience for your microservices
TRANSCRIPT
Borg-like Resilience for Your Microservices
Philip LombardiEngineer
datawire.io
Background...
1. Philip Lombardi @ Datawire.io (twitter: @TheBigLombowski)
2. Datawire.io is building a Microservices Development Kit to enable developers to build resilient microservice applications.
3. Check us out after the talk: app.datawire.io
2
datawire.io 3
What is a Microservice?
datawire.io
Common Microservice Definitions
It’s a service that is...
● Small● Self contained● Narrow in scope● Bounded context● Independent● Loosely coupled
Sort of…
All these things describe attributes
4
datawire.io
A (simpler) Microservices Definition
A Microservice is a unit of business logic.
A Microservice application is a distributed composition of business logic via services.
5
datawire.io
Microservices Benefits and Tradeoffs
● Easier to reason about the individual components that make up the system.
● Easy to add new biz logic.
● More difficult to deploy than a classic monolith.
● More difficult to operate.
6
datawire.io
Combine to build AWESOME!
7
“Death Star Topology”
datawire.io
Awesome…, until someone puts a torpedo down the vent shaft!
8
datawire.io
In reality, they rarely seem to explode...
1. When was the last time you can remember Netflix being down? Or Uber? Or Yelp? Twitter?
2. The actual Death Stars were brittle and had a clear single point of failure. But these Death Star topologies are NOT brittle.
9
datawire.io
These systems are very resilient...
They survive whole classes of problems...
● Hardware and network issues● Software bugs● Security exploits
Engineers:
● Find and fix issues without stopping the system.● Add new features to the product the biz logic represents.● Alter the system multiple times a day causing the topology to shift and change
constantly.
10
datawire.io
These systems are a lot more like The Borg
11
datawire.io
The Borg...
1. A collective hive of drones (biz logic) that are loosely controlled by The Queen (orchestration, discovery).
2. Nearly impossible to stop:a. Routinely take on numerous adversaries (bugs, security threats).
b. Continue to make progression regardless of whether The Collective is unable to communicate with all Drones because of secondary objectives. (hardware failures, network outages).
c. Forced evolution by adopting best of breed technologies. What doesn’t kill them just makes them
stronger (continuous integration and improvement).
3. The Borg assimilate new cultures and tech to strengthen their operational efficiency and resiliency.
12
datawire.io
How did these companies become Borg-like?
1. New Architecture!
2. The new architecture made them extremely resilient to infrastructure failure AND software bugs.
13
datawire.io
Failure Types...
● There are the kind everyone always engineers for…
○ Network failures○ Server failures○ Storage failures
○ Resource Limits
● And then there are the kinds we often think we’re engineering for…
○ Integration Bugs○ Functional Bugs
14
datawire.io
Integration Bugs… Your new worst enemy.
● In a Microservices app your biggest issue will be the integration bug.
● It’s nearly impossible at a certain app size to get a whole running system up and running to run integration tests.
● There’s also no compiler to save you from yourself and there is no type safety at service boundaries.
● Integration bugs are a way of life in a Microservices app because of how the system is decomposed into many small independent units.
15
datawire.io
The new architecture
● Born from a need to be resilient to integration bugs
● Allowed adopters of the new architecture to move quickly as they found a way to be resilient to both infrastructure level issues AND software integration bugs.
16
datawire.io
Routing table is (relatively) static
Routing policy is global
Traditional Architecture
17
Client
DNS
Load Balancer
Serverre
solv
e
traffic
datawire.io
Traditional Architecture
18
Client
DNS
Load Balancer
Serverre
solv
e
traffic
Load Balancers are designed to protect against infrastructure failures first and foremost.
datawire.io
Traditional Architecture
19
Client
DNS
Load Balancer
Serverre
solv
e
traffic
Infrastructure is NOT the biggest cause of bugs and system failure in 2016...
datawire.io
Traditional Architecture
20
Client
DNS
Load Balancer
Server
reso
lve
traffic
Biggest issue is integration bugs...
1.1
1.0
1.0
1.0
datawire.io
Traditional Architecture
21
Client
DNS
Load Balancer
Server
reso
lve
traffic
New service is returning faulty JSON / XML / CSV to the Client.
1.1
1.0
1.0
1.0
datawire.io
Traditional Architecture
22
Client
DNS
Load Balancer
Server
reso
lve
traffic
LB thinks everything is OK. Client is exploding.
1.1
1.0
1.0
1.0
datawire.io
But some smart folks figured out a better way...
● The architecture involves using “Smart Endpoints”.
● Each node is “smart” because:
○ Node knows how to communicate with every other node without a load balancer. Intelligence of a
central load balancer exists on each node.
○ When an integration bug happens the client node can blacklist the misbehaving service node and
still use the other set of nodes.
● The end result is a mesh of intelligent intercommunicating service nodes (like the Borg Collective).
23
datawire.io
Smart Endpoints Architecture
24
Client
DiscoveryServer
heartbeatsro
utes
Smart Endpoint
datawire.io
Smart Endpoints Architecture
25
Client
DiscoveryServer
heartbeatsro
utes
Smart Endpoint
Servers send their addresses to Discovery and periodic heartbearts which protects you against infrastructure issues.
datawire.io
Smart Endpoints Architecture
26
Client
DiscoveryServer
heartbeatsro
utes
Smart Endpoint
Discovery pushes addresses to clients which keep the server addresses in a local hash table. Discovery is not a SPOF because it’s just a broker. Client owns its own independent routing table.
datawire.io
Smart Endpoints Architecture
27
Client
Discovery
Server
heartbeatsro
utes
Smart Endpoint
In the Smart Endpoint model when a Client talks to our buggyservice and fails due to a software bug it blacklists the node!
1.1
1.0
1.0
1.0
datawire.io
Smart Endpoints Architecture
28
Client
Discovery
Server
heartbeatsro
utes
Smart Endpoint
Failure is detected QUICKLY and while a tiny amount of traffic will still fail compared to the LB model it’s a tiny tiny amount.
1.1
1.0
1.0
1.0
datawire.io
It’s mostly about Circuit Breakers...
29
datawire.io
It’s mostly about Circuit Breakers...
● Smart Endpoints as an architecture works because of a tech called Circuit Breakers
● Nodes independently track usage of remote services when they encounter failure due to software or infrastructure then the remote service is blacklisted.
● It’s important to understand circuit breakers are local and not global. Each service in your system might have a different concept of “working”.
30
datawire.io
Circuit Breakers are really powerful...
● Circuit breakers provide safety from both infrastructure AND software issues
● Timeouts, network partitions, and server failures are all transient bugs that come and go. Node can be temporarily blacklisted when a failure is due to an infrastructure issue.
● Software bug such as our aforementioned integration bug on the remote service is never going to be fixed. Node can be permanently blacklisted.
31
datawire.io
Smart Endpoints In a Nutshell
Two big but very simple things!
1. Each service maintains its own record of addresses in the environment. A service can in theory talk to any other service.
2. Circuit breakers prevent catastrophic failures by blacklisting misbehaving nodes. Blacklisting is done on a per-node basis.
32
datawire.io
Smart Endpoints Advantages...
● Smart Endpoints allow us to do quick integration testing because we don’t have to worry about catastrophic cascade failure (blacklist the misbehaving node and talk to known working nodes).
● Smart Endpoints make integration bugs far less dangerous and therefore enable faster development cycles.
● Still prevent classic infrastructure issues from being the downfall of your app.
● Do you see where this is going?
33
datawire.io
Back to Borg-like Architecture...
● If Smart endpoints prevent catastrophic software failure then the following things become true:
○ Change becomes a way of life. Modifying the system with new services becomes something
developers, management and operations is comfortable with.
○ Change means new technology can be integrated, features added, bugs can be fixed and security
holes patched without fear that a million or billion dollar line of business application fails and
costs the company huge $$$.
○ The end result is a an application that is in a continual state of improvement.
34
datawire.io
To learn more
Contact me:
● [email protected]● Twitter: @TheBigLombowski
Jobs
● https://www.datawire.io/careers/○ Java, Python, Go and Kubernetes Developers Wanted!
Try
● https://github.com/datawire/mdk● https://www.datawire.io
35