Resiliency, Failures vs Errors,
Isolation, Delegation
and Replication
in Reactive Systems
Jonas Bonér CTO TypEsafe
@jboner
“But it ain’t how hard you’re hit; it’s about how hard you can get hit, and keep moving forward. How much you can take, and keep moving forward. That’s
how winning is done.”- Rocky Balboa
“But it ain’t how hard you’re hit; it’s about how hard you can get hit, and keep moving forward. How much you can take, and keep moving forward. That’s
how winning is done.”- Rocky Balboa
This is Fault Tolerance
Resilience
is Beyond
Fault Tolerance
Resilience
“The ability of a substance or object to spring back into shape. The capacity to recover quickly
from difficulties.”-Merriam Webster
Without Resilience
Nothing Else Matters
Without Resilience
Nothing Else Matters
“We can model and understand in isolation. But, when released into competitive nominally
regulated societies, their connections proliferate, their interactions and interdependencies multiply,
their complexities mushroom. And we are caught short.”
- Sidney Dekker
Drift into Failure - Sidney Dekker
We Need to Study
Resilience in
Complex Systems
Complicated System
Complicated System
Complex System
Complex System
Complex System
Complicated ≠ Complex
“Counterintuitive. That’s [Jay] Forrester’s word to describe complex systems. Leverage points are not intuitive. Or if they are, we intuitively
use them backward, systematically worsening whatever problems we are trying to solve.”
- Donella Meadows
Leverage Points: Places to Intervene in a System - Donella Meadows
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Economic Failure
Boundary
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Economic Failure
Boundary
Unacceptable Workload Boundary
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Economic Failure
Boundary
Unacceptable Workload Boundary
Operating Point
Accident Boundary
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Economic Failure
Boundary
Unacceptable Workload Boundary
FAILURE
Accident Boundary
Operating at the Edge of Failure
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
Management Pressure
Towards Economic Efficiency
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
Management Pressure
Towards Economic Efficiency
Gradient Towards Least Effort
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
Management Pressure
Towards Economic Efficiency
Gradient Towards Least Effort
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
Management Pressure
Towards Economic Efficiency
Gradient Towards Least Effort
Counter Gradient For More Resilience
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
Error Margin
Marginal Boundary
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
Error Margin
Marginal Boundary
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Economic Failure
Boundary
Unacceptable Workload Boundary
Accident Boundary
Error Margin
Marginal Boundary
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Accident Boundary
Marginal Boundary
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Marginal Boundary
?
Operating at the Edge of Failure
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure
Accident Boundary
Marginal Boundary
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure
Accident Boundary
Marginal Boundary
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure
Accident Boundary
Marginal Boundary
‘‘Going solid’’: a model of system dynamics and consequences for patient safety - R Cook, J Rasmussen Resilience in complex adaptive systems: Operating at the Edge of Failure - Richard Cook - Talk at Velocity NY 2013
Operating at the Edge of Failure
Accident Boundary
Marginal Boundary
Embrace Failure
Resilience in
Social Systems
Simple Critical Infrastructure Map
Understanding vital services, and how they keep you safe
6 ways to die
3 sets of essential services
7 layers of PROTECTION
Dealing in Security - Mike Bennet, Vinay Gupta
7 Principles for Building
Resilience in Social Systems
1. Maintain diversity & Redundancy 2. Manage connectivity 3. Manage slow variables & feedback 4. Foster complex adaptive systems thinking 5. Encourage learning 6. Broaden participation 7. Promote polycentric governance
Principles for Building Resilience: Sustaining Ecosystem Services in Social-Ecological Systems - Reinette Biggs et. al.
Resilience in
Biological Systems
MeerkatsPuppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk
What We Can Learn
From Biological Systems
1. Feature Diversity and redundancy 2. Inter-Connected network structure 3. Wide distribution across all scales 4. Capacity to self-adapt & self-organize
Toward Resilient Architectures 1: Biology Lessons - Michael Mehaffy, Nikos A. Salingaros
“Animals show extraordinary social complexity, and this allows them to adapt and
respond to changes in their environment. In three words, in the animal kingdom,
simplicity leads to complexity which leads to resilience.”
- Nicolas Perony
Puppies! Now that I’ve got your attention, complexity theory - Nicolas Perony, TED talk
Resilience in
Computer Systems
“Complex systems run in degraded mode.” “Complex systems run as broken systems.”
- richard Cook
How Complex Systems Fail - Richard Cook
Resilience
is by
Design
Photo courtesy of FEMA/Joselyne Augustino
We Need to
Manage Failure
“Post-accident attribution to a ‘root cause’ is fundamentally wrong:
Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident.”
- richard Cook
How Complex Systems Fail - Richard Cook
There is No
Root Cause
Crash Only
Software
Crash-Only Software - George Candea, Armando Fox
Stop = Crash Safely Start = Recover Fast
“To make a system of interconnected components crash-only, it must be designed so that components
can tolerate the crashes and temporary unavailability of their peers. This means we require: [1] strong
modularity with relatively impermeable component boundaries, [2] timeout-based communication and
lease-based resource allocation, and [3] self-describing requests that carry a time-to-live and
information on whether they are idempotent.”- George Candea, Armando Fox
Crash-Only Software - George Candea, Armando Fox
Recursive Restartability
Turning the Crash-Only Sledgehammer into a Scalpel
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - George Candea, Armando Fox
Services need to accept
NO for an answer
"Software components should be designed such that they can deny service for any request or call.
Then, if an underlying component can say No, apps must be designed to take No for an answer
and decide how to proceed: give up, wait and retry, reduce fidelity, etc.”
- George Candea, Armando Fox
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - George Candea, Armando Fox
Learn to take
NO for an answer
“The explosive growth of software has added greatly to systems’ interactive
complexity. With software, the possible states that a system can end up in become
mind-boggling.”
- Sidney Dekker
Drift into Failure - Sidney Dekker
Classification
of State
• Static Data • Scratch Data • Dynamic Data
• Recomputable • not recomputable
Classification
of State
• Static Data • Scratch Data • Dynamic Data
• Recomputable • not recomputable
Critical
We Need a
Way Out of the
State Tar Pit
Out of the Tar Pit - Ben Moseley , Peter Marks
Essential State
We Need a
Way Out of the
State Tar Pit
Out of the Tar Pit - Ben Moseley , Peter Marks
Essential State
We Need a
Way Out of the
State Tar Pit
Out of the Tar Pit - Ben Moseley , Peter Marks
Essential Logic
Essential State
We Need a
Way Out of the
State Tar Pit
Out of the Tar Pit - Ben Moseley , Peter Marks
Essential Logic
Accidental State and Control
Traditional
State Management
Object
Critical state that needs protection
Client
Thread boundary
Traditional
State Management
Object
Critical state that needs protection
Client
Thread boundary
Traditional
State Management
Object
Critical state that needs protection
Client
Thread boundary
Traditional
State Management
Object
Critical state that needs protection
Client
Thread boundary
Synchronous dispatch Thread boundary
Traditional
State Management
Object
Critical state that needs protection
Client
Thread boundary
Synchronous dispatch Thread boundary
Traditional
State Management
Object
Critical state that needs protection
Client
Thread boundary
Synchronous dispatch Thread boundary
?
Traditional
State Management
Object
Critical state that needs protection
Client
Thread boundary
Synchronous dispatch Thread boundary
?
Utterly broken
Requirements for a
Sane Failure Mode
1. Contained 2. Reified—as messages 3. Signalled—Asynchronously 4. Observed—by 1-N 5. Managed
Failures need to be
Akka Actors
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
Define the message(s) the Actor should be able to respond to
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
Define the message(s) the Actor should be able to respond to
Define the Actor class
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
Define the message(s) the Actor should be able to respond to
Define the Actor class
Define the Actor’s behavior
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
Create an Actor system
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
Create an Actor systemActor configuration
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
Give it a name
Create an Actor systemActor configuration
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
Give it a nameCreate the Actor
Create an Actor systemActor configuration
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
Give it a nameCreate the ActorYou get an ActorRef back
Create an Actor systemActor configuration
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
greeter ! Greeting("Charlie Parker")
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
greeter ! Greeting("Charlie Parker")
Send the message asynchronously
Akka Actors
case class Greeting(who: String)
class GreetingActor extends Actor with ActorLogging { def receive = { case Greeting(who) => log.info(s”Hello ${who}") } }
val system = ActorSystem("MySystem") val greeter = system.actorOf(Props[GreetingActor], "greeter")
greeter ! Greeting("Charlie Parker")
Bulkhead
Pattern
Bulkhead
Pattern
Bulkhead
Pattern
Enter Supervision
Enter Supervision
The
Vending
Machine
Pattern
Think Vending Machine
Coffee Machine
Programmer
Think Vending Machine
Coffee Machine
Programmer
Inserts coins
Think Vending Machine
Coffee Machine
Programmer
Inserts coins
Add more coins
Think Vending Machine
Coffee Machine
Programmer
Inserts coins
Gets coffee
Add more coins
Think Vending Machine
ProgrammerCoffee
Machine
Think Vending Machine
Programmer
Inserts coins
Coffee Machine
Think Vending Machine
Programmer
Inserts coins
Out of coffee beans error Coffee Machine
Think Vending Machine
Programmer
Inserts coins
Out of coffee beans error
WRONGCoffee
Machine
Think Vending Machine
Programmer
Inserts coins
Coffee Machine
Think Vending Machine
Programmer
Inserts coins
Out of coffee beans
failure
Coffee Machine
Think Vending Machine
Programmer
Service Guy
Inserts coins
Out of coffee beans
failure
Coffee Machine
Think Vending Machine
Programmer
Service Guy
Inserts coins
Out of coffee beans
failure
Adds more beans
Coffee Machine
Think Vending Machine
Programmer
Service Guy
Inserts coins
Gets coffee
Out of coffee beans
failure
Adds more beans
Coffee Machine
Think Vending Machine
ServiceClient
Think Vending Machine
ServiceClient
Request
Think Vending Machine
ServiceClient
Request
Response
Think Vending Machine
ServiceClient
Request
Response
Validation Error
Think Vending Machine
ServiceClient
Request
Response
Validation Error
Application Failure
Think Vending Machine
ServiceClient
Supervisor
Request
Response
Validation Error
Application Failure
Think Vending Machine
ServiceClient
Supervisor
Request
Response
Validation Error
Application Failure
Manages Failure
“Accidents come from relationships not broken parts.”
- Sidney dekker
Drift into Failure - Sidney Dekker
Error Kernel
Pattern
Onion-layered state & Failure management
Making reliable distributed systems in the presence of software errors - Joe Armstrong On Erlang, State and Crashes - Jesper Louis Andersen
Onion Layered
State Management
Object
Critical state that needs protection
Client
Thread boundary
Onion Layered
State Management
Object
Critical state that needs protection
Client
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Supervision
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Supervision
Supervision
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Supervision
Supervision
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Supervision
Supervision
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Supervision
Supervision
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Supervision
Supervision
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Supervision
Supervision
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Supervision
Supervision
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Supervision
Supervision
Thread boundary
Onion Layered
State Management
Error Kernel
Object
Critical state that needs protection
Client
Supervision
Supervision
Thread boundary
Supervision in Akka
Every actor has a default supervisor strategy. Which can, and often should, be overridden.
class Supervisor extends Actor { override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) { case _: ArithmeticException => Resume case _: NullPointerException => Restart case _: Exception => Escalate }
val worker = context.actorOf(Props[Worker], name = "worker")
def receive = { case number: Int => worker.forward(number) } }
Supervision in Akka
Every actor has a default supervisor strategy. Which can, and often should, be overridden.
class Supervisor extends Actor { override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) { case _: ArithmeticException => Resume case _: NullPointerException => Restart case _: Exception => Escalate }
val worker = context.actorOf(Props[Worker], name = "worker")
def receive = { case number: Int => worker.forward(number) } }
Parent actor
Supervision in Akka
Every actor has a default supervisor strategy. Which can, and often should, be overridden.
class Supervisor extends Actor { override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) { case _: ArithmeticException => Resume case _: NullPointerException => Restart case _: Exception => Escalate }
val worker = context.actorOf(Props[Worker], name = "worker")
def receive = { case number: Int => worker.forward(number) } }
Parent actor
All its children have their life-cycle managed through this declarative supervision strategy
Supervision in Akka
Every actor has a default supervisor strategy. Which can, and often should, be overridden.
class Supervisor extends Actor { override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) { case _: ArithmeticException => Resume case _: NullPointerException => Restart case _: Exception => Escalate }
val worker = context.actorOf(Props[Worker], name = "worker")
def receive = { case number: Int => worker.forward(number) } }
Create a supervised child actor
Parent actor
All its children have their life-cycle managed through this declarative supervision strategy
Monitor through
Death Watch
class Watcher extends Actor { val child = context.actorOf(Props.empty, "child") context.watch(child)
def receive = { case Terminated(`child`) => … // handle child termination } }
Monitor through
Death Watch
class Watcher extends Actor { val child = context.actorOf(Props.empty, "child") context.watch(child)
def receive = { case Terminated(`child`) => … // handle child termination } }
Create a child actor
Monitor through
Death Watch
class Watcher extends Actor { val child = context.actorOf(Props.empty, "child") context.watch(child)
def receive = { case Terminated(`child`) => … // handle child termination } }
Create a child actor
Watch it
Monitor through
Death Watch
class Watcher extends Actor { val child = context.actorOf(Props.empty, "child") context.watch(child)
def receive = { case Terminated(`child`) => … // handle child termination } }
Create a child actor
Watch it
Receive termination signal
Maintain Diversity
and Redundancy
Maintain Diversity
and Redundancy
Akka Routing
akka { actor { deployment { /service/router { router = round-robin-pool resizer { lower-bound = 12 upper-bound = 15 } } } } }
Akka Cluster
akka { actor { deployment { /service/router { router = round-robin-pool resizer { lower-bound = 12 upper-bound = 15 } } }
provider = "akka.cluster.ClusterActorRefProvider" } cluster { seed-nodes = [ “akka.tcp://[email protected]:2551", “akka.tcp://[email protected]:2552" ] } }
The Network
is Reliable
The Network
is Reliable
NAT
We are living in the Looming
Shadow of
Impossibility
Theorems
We are living in the Looming
Shadow of
Impossibility
Theorems
CAP: Consistency is impossible
We are living in the Looming
Shadow of
Impossibility
Theorems
CAP: Consistency is impossibleFLP: Consensus is impossible
Strong
Consistency
Is the wrong default
We need Decoupling IN
Time and Space
Resilient Protocols
Depend on Asynchronous Communication
Eventual Consistency
Resilient Protocols
• are tolerant to• Message loss• Message reordering• Message duplication
Depend on Asynchronous Communication
Eventual Consistency
Resilient Protocols
• are tolerant to• Message loss• Message reordering• Message duplication
• Embrace ACID 2.0• Associative
• Commutative
• Idempotent
• Distributed
Depend on Asynchronous Communication
Eventual Consistency
Akka Distributed Data
class DataBot extends Actor with ActorLogging { val replicator = DistributedData(context.system).replicator implicit val node = Cluster(context.system)
val tickTask = context.system.scheduler.schedule( 5.seconds, 5.seconds, self, Add)
val DataKey = ORSetKey[String]("key") replicator ! Subscribe(DataKey, self)
def receive = {
Akka Distributed Data
class DataBot extends Actor with ActorLogging { val replicator = DistributedData(context.system).replicator implicit val node = Cluster(context.system)
val tickTask = context.system.scheduler.schedule( 5.seconds, 5.seconds, self, Add)
val DataKey = ORSetKey[String]("key") replicator ! Subscribe(DataKey, self)
def receive = { case Tick => val s = ThreadLocalRandom.current().nextInt(97, 123).toChar.toString if (ThreadLocalRandom.current().nextBoolean()) replicator ! Update(DataKey, ORSet.empty[String], WriteLocal)(_ + s) else replicator ! Update(DataKey, ORSet.empty[String], WriteLocal)(_ - s) case c @ Changed(DataKey) => val data = c.get(DataKey)
case _: UpdateResponse[_] => // ignore } }
“An escalator can never break: it can only become stairs. You should never see an
Escalator Temporarily Out Of Order sign, just Escalator Temporarily Stairs. Sorry for the
convenience.”- Mitch Hedberg
Graceful
Degradation
Always Rely on Asynchronous
Communication
Always Rely on Asynchronous
Communication
…And If not pOssiblE Always use
Timeouts
Circuit Breaker
Circuit Breaker in Akka
val breaker = new CircuitBreaker( context.system.scheduler, maxFailures = 5, callTimeout = 10.seconds, resetTimeout = 1.minute ).onOpen(…) .onHalfOpen(…)
val result = breaker.withCircuitBreaker(Future(dangerousCall()))
Little’s Law
������
W: Response Time
L: Queue Length
Little’s Law
������
Queue Length = Arrival Rate * Response Time
W: Response Time
L: Queue Length
Little’s Law
�������
Response Time = Queue Length / Arrival Rate
W: Response Time
L: Queue Length
Flow Control
Flow Control
Always Apply Back Pressure
Akka Streams
Akka Streams
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge
val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._
val in = Source(1 to 10) val out = Sink.ignore
val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2))
val f1, f2, f3, f4 = Flow[Int].map(_ + 10)
}
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge
val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._
val in = Source(1 to 10) val out = Sink.ignore
val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2))
val f1, f2, f3, f4 = Flow[Int].map(_ + 10)
}
Set up the context
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge
val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._
val in = Source(1 to 10) val out = Sink.ignore
val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2))
val f1, f2, f3, f4 = Flow[Int].map(_ + 10)
}
Set up the context
Create the Source and Sink
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge
val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._
val in = Source(1 to 10) val out = Sink.ignore
val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2))
val f1, f2, f3, f4 = Flow[Int].map(_ + 10)
}
Set up the context
Create the Source and Sink
Create the fan out and fan in stages
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge
val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._
val in = Source(1 to 10) val out = Sink.ignore
val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2))
val f1, f2, f3, f4 = Flow[Int].map(_ + 10)
}
Set up the context
Create the Source and Sink
Create the fan out and fan in stages
Create a set of transformations
Akka Streams
in ~> f1 ~> bcast ~> f2 ~> merge ~> f3 ~> out bcast ~> f4 ~> merge
val g = FlowGraph.closed() { implicit builder: FlowGraph.Builder => import FlowGraph.Implicits._
val in = Source(1 to 10) val out = Sink.ignore
val bcast = builder.add(Broadcast[Int](2)) val merge = builder.add(Merge[Int](2))
val f1, f2, f3, f4 = Flow[Int].map(_ + 10)
}
Set up the context
Create the Source and Sink
Create the fan out and fan in stages
Create a set of transformations
Define the graph processing blueprint
Testing
What can we learn from Arnold?
What can we learn from Arnold?
What can we learn from Arnold?
Blow things up
Shoot
Your App
Down
Pull the Plug
…and see what happens
Executive
Summary
“Complex systems run in degraded mode.” “Complex systems run as broken systems.”
- richard Cook
How Complex Systems Fail - Richard Cook
Resilience
is by
Design
Photo courtesy of FEMA/Joselyne Augustino
ReferencesAntifragile: Things That Gain from Disorder - http://www.amazon.com/Antifragile-Things-that-Gain-Disorder-ebook/dp/B009K6DKTS Drift into Failure - http://www.amazon.com/Drift-into-Failure-Components-Understanding-ebook/dp/B009KOKXKYHow Complex Systems Fail - http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdfLeverage Points: Places to Intervene in a System - http://www.donellameadows.org/archives/leverage-points-places-to-intervene-in-a-system/ Going Solid: A Model of System Dynamics and Consequences for Patient Safety - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1743994/Resilience in Complex Adaptive Systems: Operating at the Edge of Failure - https://www.youtube.com/watch?v=PGLYEDpNu60Dealing in Security - http://resiliencemaps.org/files/Dealing_in_Security.July2010.en.pdfPrinciples for Building Resilience: Sustaining Ecosystem Services in Social-Ecological Systems - http://www.amazon.com/Principles-Building-Resilience-Sustaining-Social-Ecological/dp/110708265X Puppies! Now that I’ve got your attention, Complexity Theory - https://www.ted.com/talks/nicolas_perony_puppies_now_that_i_ve_got_your_attention_complexity_theoryHow Bacteria Becomes Resistant - http://www.abc.net.au/science/slab/antibiotics/resistance.htmTowards Resilient Architectures: Biology Lessons - http://www.metropolismag.com/Point-of-View/March-2013/Toward-Resilient-Architectures-1-Biology-Lessons/Crash-Only Software - https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea.pdfRecursive Restartability: Turning the Reboot Sledgehammer into a Scalpel - http://roc.cs.berkeley.edu/papers/recursive_restartability.pdfOut of the Tar Pit - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.8928Bulkhead Pattern - http://skife.org/architecture/fault-tolerance/2009/12/31/bulkheads.htmlMaking Reliable Distributed Systems in the Presence of Software Errors - http://www.erlang.org/download/armstrong_thesis_2003.pdfOn Erlang, State and Crashes - http://jlouisramblings.blogspot.be/2010/11/on-erlang-state-and-crashes.htmlAkka Supervision - http://doc.akka.io/docs/akka/snapshot/general/supervision.htmlRelease It!: Design and Deploy Production-Ready Software - https://pragprog.com/book/mnee/release-itHystrix - https://github.com/Netflix/HystrixAkka Circuit Breaker - http://doc.akka.io/docs/akka/snapshot/common/circuitbreaker.html Reactive Streams - http://reactive-streams.orgAkka Streams - http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-introduction.htmlRxJava - https://github.com/ReactiveX/RxJavaFeedback Control for Computer Systems - http://www.amazon.com/Feedback-Control-Computer-Systems-Philipp/dp/1449361692Simian Army - https://github.com/Netflix/SimianArmyGatling - http://gatling.ioAkka MultiNode Testing - http://doc.akka.io/docs/akka/snapshot/dev/multi-node-testing.html
EXPERT TRAINING Delivered On-site For Spark, Scala, Akka And Play
Help is just a click away. Get in touch with Typesafe about our training courses.
• Intro Workshop to Apache Spark • Fast Track & Advanced Scala • Fast Track to Akka with Java or Scala • Fast Track to Play with Java or Scala • Advanced Akka with Java or Scala
Ask us about local trainings available by 24 Typesafe partners in 14 countries around the world.
CONTACT US Learn more about on-site training
Thank
You