fault tolerance in a high volume, distributed system

49
Fault Tolerance in a High Volume, Distributed System Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen 1

Upload: ben-christensen

Post on 11-May-2015

4.801 views

Category:

Technology


3 download

DESCRIPTION

More information can be found at http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

TRANSCRIPT

Page 1: Fault Tolerance in a  High Volume, Distributed System

Fault Tolerance in a High Volume, Distributed SystemBen ChristensenSoftware Engineer – API Platform at Netflix@benjchristensenhttp://www.linkedin.com/in/benjchristensen

1

Page 2: Fault Tolerance in a  High Volume, Distributed System

Dozens of dependencies.

One going down takes everything down.

99.99%30 = 99.7% uptime0.3% of 1 billion = 3,000,000 failures

2+ hours downtime/montheven if all dependencies have excellent uptime.

Reality is generally worse.

2

Page 3: Fault Tolerance in a  High Volume, Distributed System

3

Page 4: Fault Tolerance in a  High Volume, Distributed System

4

Page 5: Fault Tolerance in a  High Volume, Distributed System

5

Page 6: Fault Tolerance in a  High Volume, Distributed System

No single dependency should take down the entire app.

Fail fast.Fail silent.Fallback.

Shed load.

6

Page 7: Fault Tolerance in a  High Volume, Distributed System

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

7

Page 8: Fault Tolerance in a  High Volume, Distributed System

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

8

Page 9: Fault Tolerance in a  High Volume, Distributed System

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

9

Page 10: Fault Tolerance in a  High Volume, Distributed System

TryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();}

Semaphores (Tryable): Limited Concurrency

10

Page 11: Fault Tolerance in a  High Volume, Distributed System

TryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();}

Semaphores (Tryable): Limited Concurrency

if (executionSemaphore.tryAcquire()) { } else { }

11

Page 12: Fault Tolerance in a  High Volume, Distributed System

TryableSemaphore executionSemaphore = getExecutionSemaphore();// acquire a permitif (executionSemaphore.tryAcquire()) { try { return executeCommand(); } finally { executionSemaphore.release(); }} else { circuitBreaker.markSemaphoreRejection(); // permit not available so return fallback return getFallback();}

Semaphores (Tryable): Limited Concurrency

if (executionSemaphore.tryAcquire()) { } else { return getFallback();}

12

Page 13: Fault Tolerance in a  High Volume, Distributed System

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

13

Page 14: Fault Tolerance in a  High Volume, Distributed System

try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior

throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); }

... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();}

Separate Threads: Limited Concurrency

14

Page 15: Fault Tolerance in a  High Volume, Distributed System

try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior

throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); }

... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();}

Separate Threads: Limited Concurrency

if (!threadPool.isQueueSpaceAvailable()) {

throw new RejectedExecutionException }

} catch (RejectedExecutionException e) { }

15

Page 16: Fault Tolerance in a  High Volume, Distributed System

try { if (!threadPool.isQueueSpaceAvailable()) { // we are at the property defined max so want to throw the RejectedExecutionException to simulate // reaching the real max and go through the same codepath and behavior

throw new RejectedExecutionException("Rejected command because thread-pool queueSize is at rejection threshold."); }

... define Callable that performs executeCommand() ... // submit the work to the thread-pool return threadPool.submit(command);} catch (RejectedExecutionException e) { circuitBreaker.markThreadPoolRejection(); // rejected so return fallback return getFallback();}

Separate Threads: Limited Concurrency

if (!threadPool.isQueueSpaceAvailable()) {

throw new RejectedExecutionException }

} catch (RejectedExecutionException e) { return getFallback();}

16

Page 17: Fault Tolerance in a  High Volume, Distributed System

Separate Threads: Timeout

public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime);

// retrieve the fallback return getFallback(); }}

Override of Future.get()

17

Page 18: Fault Tolerance in a  High Volume, Distributed System

Separate Threads: Timeout

public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime);

// retrieve the fallback return getFallback(); }}

Override of Future.get()

try { return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) {

}}

18

Page 19: Fault Tolerance in a  High Volume, Distributed System

Separate Threads: Timeout

public K get() throws CancellationException, InterruptedException, ExecutionException { try { long timeout = getCircuitBreaker().getCommandTimeoutInMilliseconds(); return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) { // report timeout failure circuitBreaker.markTimeout( System.currentTimeMillis() - startTime);

// retrieve the fallback return getFallback(); }}

Override of Future.get()

try { return get(timeout, TimeUnit.MILLISECONDS); } catch (TimeoutException e) {

return getFallback(); }}

19

Page 20: Fault Tolerance in a  High Volume, Distributed System

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

20

Page 21: Fault Tolerance in a  High Volume, Distributed System

if (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();}

Circuit Breaker

21

Page 22: Fault Tolerance in a  High Volume, Distributed System

if (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();}

Circuit Breaker

if (circuitBreaker.allowRequest()) { } else { }

22

Page 23: Fault Tolerance in a  High Volume, Distributed System

if (circuitBreaker.allowRequest()) { return executeCommand();} else { // short-circuit and go directly to fallback circuitBreaker.markShortCircuited(); return getFallback();}

Circuit Breaker

if (circuitBreaker.allowRequest()) { } else { return getFallback(); }

23

Page 24: Fault Tolerance in a  High Volume, Distributed System

Netflix uses all 4 in combination

24

Page 25: Fault Tolerance in a  High Volume, Distributed System

25

Page 26: Fault Tolerance in a  High Volume, Distributed System

Tryable semaphores for “trusted” clients and fallbacks

Separate threads for “untrusted” clients

Aggressive timeouts on threads and network callsto “give up and move on”

Circuit breakers as the “release valve”

26

Page 27: Fault Tolerance in a  High Volume, Distributed System

27

Page 28: Fault Tolerance in a  High Volume, Distributed System

28

Page 29: Fault Tolerance in a  High Volume, Distributed System

29

Page 30: Fault Tolerance in a  High Volume, Distributed System

Benefits of Separate Threads

Protection from client libraries

Lower risk to accept new/updated clients

Quick recovery from failure

Client misconfiguration

Client service performance characteristic changes

Built-in concurrency30

Page 31: Fault Tolerance in a  High Volume, Distributed System

Drawbacks of Separate Threads

Some computational overhead

Load on machine can be pushed too far

...

Benefits outweigh drawbackswhen clients are “untrusted”

31

Page 32: Fault Tolerance in a  High Volume, Distributed System

32

Page 33: Fault Tolerance in a  High Volume, Distributed System

Visualizing Circuits in Realtime(generally sub-second latency)

Video available athttps://vimeo.com/33576628

33

Page 34: Fault Tolerance in a  High Volume, Distributed System

Rolling 10 second counter – 1 second granularity

Median Mean 90th 99th 99.5th

Latent Error Timeout Rejected

Error Percentage(error+timeout+rejected)/

(success+latent success+error+timeout+rejected).

34

Page 35: Fault Tolerance in a  High Volume, Distributed System

Netflix DependencyCommand Implementation

35

Page 36: Fault Tolerance in a  High Volume, Distributed System

Netflix DependencyCommand Implementation

36

Page 37: Fault Tolerance in a  High Volume, Distributed System

Netflix DependencyCommand Implementation

37

Page 38: Fault Tolerance in a  High Volume, Distributed System

Netflix DependencyCommand Implementation

38

Page 39: Fault Tolerance in a  High Volume, Distributed System

Netflix DependencyCommand Implementation

39

Page 40: Fault Tolerance in a  High Volume, Distributed System

Netflix DependencyCommand Implementation

40

Page 41: Fault Tolerance in a  High Volume, Distributed System

Netflix DependencyCommand Implementation

Fallbacks

CacheEventual Consistency

Stubbed DataEmpty Response

41

Page 42: Fault Tolerance in a  High Volume, Distributed System

Netflix DependencyCommand Implementation

42

Page 43: Fault Tolerance in a  High Volume, Distributed System

Netflix DependencyCommand Implementation

43

Page 44: Fault Tolerance in a  High Volume, Distributed System

Rolling NumberRealtime Stats and Decision Making

44

Page 45: Fault Tolerance in a  High Volume, Distributed System

Request CollapsingTake advantage of resiliency to improve efficiency

45

Page 46: Fault Tolerance in a  High Volume, Distributed System

Request CollapsingTake advantage of resiliency to improve efficiency

46

Page 47: Fault Tolerance in a  High Volume, Distributed System

47

Page 48: Fault Tolerance in a  High Volume, Distributed System

Fail fast.Fail silent.Fallback.

Shed load.

48

Page 49: Fault Tolerance in a  High Volume, Distributed System

Questions & More Information

Fault Tolerance in a High Volume, Distributed Systemhttp://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html

Making the Netflix API More Resilienthttp://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html

Ben Christensen@benjchristensen

http://www.linkedin.com/in/benjchristensen

49