on failure and resilience
TRANSCRIPT
![Page 1: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/1.jpg)
On Failure and Resilience
Mike Brittain!"#$%&'# '( $)*")$$#")*, $&+,
@mikebrittain
!resented at "#signals on $ug %&, %'&%
![Page 2: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/2.jpg)
“Software Infrastructure”“Framework” code, caching, ORM, file storage tier, developer tools, CI!deployment, site performance,
front-end architecture.
![Page 3: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/3.jpg)
Managing failures and building resilience into systems, applications,
process, and people.
![Page 4: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/4.jpg)
![Page 5: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/5.jpg)
Photo: http://www.etsy.com/shop/TheOldTimeJunkShop
$61 M in goods sold in the marketplace2.9 M items sold1.2 B page views
http://www.etsy.com/blog/news/2012/etsy-statistics-june-weather-report/
![Page 6: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/6.jpg)
ArchitectureLinux, Apache, MySQL, PHP, Postgres, Solr, Gearman, Memcache, Chef, Hadoop, EC%(S"(EMR
"') Logical data stores(%" shards ) more functionally partitioned)
Search and storage tiers as “services”
![Page 7: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/7.jpg)
150 Engineers + Designers + Product(this was 20 in Feb 2010)
credit: martin_heigan (flickr)
![Page 8: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/8.jpg)
Buyers, sellers, support, developer api, i&*n, core infrastructure, storage, payments, security, fraud detection, big data and BI, email delivery, corp IT, operations, developer tools, continuous integration and testing, site performance,search, advertising, seller economics, mobile web, iOS.
![Page 9: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/9.jpg)
![Page 10: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/10.jpg)
Zero Release Managers
![Page 11: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/11.jpg)
There Will Be Fail
Credit: wilkee.deviantart.com
![Page 12: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/12.jpg)
We cannot comprehend all of the ways in which the individual parts of a complex system will interact. We cannot know all of the states and scenarios.
We cannot prevent failures.
![Page 13: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/13.jpg)
Yet, we can mitigate them.
Redundant system architectures.Small, well-understood changes to production.Control application using config flags.Gratuitous metrics collection.Resilient user interfaces.GameDay exercises.
![Page 14: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/14.jpg)
“Uptime” is not binary.
![Page 15: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/15.jpg)
Convos AsyncTasks Ads Auth
Functionally Partitioned
![Page 16: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/16.jpg)
Convos AsyncTasks Ads Auth
Functionally Partitioned
![Page 17: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/17.jpg)
Master-Master Replication
Ads Ads Auth AuthAsynctasks
AsynctasksConvos Convos
1 234
5
![Page 18: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/18.jpg)
Master-Master Replication
Ads Ads Auth AuthAsynctasks
AsynctasksConvos Convos
1 234
5
![Page 19: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/19.jpg)
Master-Master Replication
Ads Ads Auth AuthAsynctasks
AsynctasksConvos Convos
1 234
5
![Page 20: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/20.jpg)
Sharded Tables
shard3 shard3 shard4 shard4shard2 shard2shard1 shard1
5 231
4
~!" of listing data is stored on shard#
![Page 21: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/21.jpg)
Sharded Tables
shard3 shard3 shard4 shard4shard2 shard2shard1 shard1
5 231
4
![Page 22: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/22.jpg)
Sharded Tables
shard3 shard3 shard4 shard4shard2 shard2shard1 shard1
Outage is limited to~!" of data set
![Page 23: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/23.jpg)
“Uptime” is not binary.
![Page 24: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/24.jpg)
Uptime of the application is the responsibility of our Operations team.
![Page 25: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/25.jpg)
Uptime of the application is the responsibility of our Operations, Engineering,Product, and Design teams.
![Page 26: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/26.jpg)
Uptime of the application is the responsibility of our Operations, Engineering,Product, and Design teams.
If you are committing code, you are operating the site.
![Page 27: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/27.jpg)
Branching in Code
![Page 28: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/28.jpg)
“All existing revision control systems were built by people who build installed software”
Always Ship TrunkPaul Hammond
Velocity Conf 2010
![Page 29: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/29.jpg)
Enable and disable features quickly.Features for staff or for beta groups.Percentage ramp-up of users or requests.A/B “experiments.”
Config Flags
![Page 30: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/30.jpg)
$cfg[‘new_search’] = array('enabled' => 'on');$cfg[‘sign_in’] = array('enabled' => 'on');$cfg[‘checkout’] = array('enabled' => 'on');$cfg[‘homepage’] = array('enabled' => 'on');
![Page 31: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/31.jpg)
$cfg[‘new_search’] = array('enabled' => 'on');
// Meanwhile...
if ($cfg[‘new_search’]) { # New hotness $results = do_solr();} else { # old and boring $results = do_grep();}
![Page 32: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/32.jpg)
But...
![Page 33: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/33.jpg)
“Doesn’t that mean you have conditionals all over your code?”
Yes.
![Page 34: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/34.jpg)
“Doesn’t that mean you have conditionals all over your code?”
Yes.
“Does anyone ever clean those up?”
Sometimes.
![Page 35: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/35.jpg)
“Doesn’t that mean you have conditionals all over your code?”
Yes.
“Does anyone ever clean those up?”
Sometimes.
“That sounds like it sucks.”Really?
![Page 36: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/36.jpg)
“Doesn’t that mean you have conditionals all over your code?”
Yes.
“Does anyone ever clean those up?”
Sometimes.
“That sounds like it sucks.”Really?
“Wait a minute... all of the counter arguments are in Comic Sans. WTF?!?
Oh, you noticed? ;)
![Page 37: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/37.jpg)
00:00Site down for maintenance
+01:47Site up, disabled login and registration
+06:40Site up, some seller tools disabled
+07:41All features restored
DB Server Maintenance, June 16, 2012http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
![Page 38: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/38.jpg)
“Uptime” is not binary.
![Page 39: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/39.jpg)
Features are launched by flipping a config flag, not by deploying
hundreds of lines of code.
![Page 40: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/40.jpg)
“If Engineering at Etsy has a religion, it’s the Church of Graphs.
Ian Malpass, Code as Crafthttp://etsy.me/ePkoZB
![Page 41: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/41.jpg)
![Page 42: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/42.jpg)
![Page 43: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/43.jpg)
http://www.flickr.com/photos/flyforfun/2694158656/
THIS IS HOWYOU RUN
A COMPLEXSYSTEM
![Page 44: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/44.jpg)
http://www.flickr.com/photos/flyforfun/2694158656/
OperatorConfig flags
Metrics
![Page 45: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/45.jpg)
Oh, you want to talk about how we collect metrics and make graphs?
http://www.slideshare.net/mikebrittain/metricsdriven-engineering
![Page 46: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/46.jpg)
Resilient User Interfaces
![Page 47: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/47.jpg)
Interfaces and user experiencesthat adapt to technical andarchitectural failure.
![Page 48: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/48.jpg)
![Page 49: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/49.jpg)
![Page 50: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/50.jpg)
http://www.flickr.com/photos/caffeina/2144044776/
![Page 51: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/51.jpg)
http://www.flickr.com/photos/17793901@N00/106331831/
![Page 52: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/52.jpg)
![Page 53: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/53.jpg)
![Page 54: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/54.jpg)
/** * Creates a database connection. */ public function __construct($host, $user, $pass, $db) { parent::__construct($host, $user, $pass, $db);
if (mysqli_connect_error()) {
throw new DBConnection_Exception( sprintf("Error: %s, %s", mysqli_connect_errno(), mysqli_connect_error()));
}}
![Page 55: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/55.jpg)
try { $conn = new DBConnection('viewsdb.host', 'db_read_user', 'ssssshh!', 'views_db');} catch (DBConnection_Exception $e) {
// TODO: Someone should figure out what to do if // we can't connect to the views db. throw $e;}
![Page 56: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/56.jpg)
![Page 57: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/57.jpg)
![Page 58: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/58.jpg)
Site navigationLogo
Cute Picture
Generic, catch-allerror messaging....
![Page 59: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/59.jpg)
http://www.flickr.com/photos/caffeina/2144044776/
![Page 60: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/60.jpg)
Every back-end service is anopportunity for failure.
![Page 61: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/61.jpg)
![Page 62: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/62.jpg)
![Page 63: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/63.jpg)
![Page 64: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/64.jpg)
1
2 3
4
56
10
8
9
4 11
13
12
7
147
![Page 65: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/65.jpg)
![Page 66: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/66.jpg)
Critical Path
![Page 67: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/67.jpg)
![Page 68: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/68.jpg)
![Page 69: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/69.jpg)
![Page 70: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/70.jpg)
http://www.flickr.com/photos/caffeina/2144044776/
#srsly?
![Page 71: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/71.jpg)
" #$$ ms
![Page 72: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/72.jpg)
Non-blocking Ajax
![Page 73: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/73.jpg)
Google Docs
Google Calendar
![Page 74: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/74.jpg)
GMail
![Page 75: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/75.jpg)
“Oops, we aren’t able to access click metrics right
now, do not worry — your data is safe.”
![Page 76: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/76.jpg)
Product design doesn’t stopat 100% availability.
![Page 77: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/77.jpg)
OpsDev
![Page 78: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/78.jpg)
Product
OpsDev
![Page 79: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/79.jpg)
1
2 3
4
56
10
8
9
4 11
13
12
7
147
![Page 80: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/80.jpg)
Operability Reviews
![Page 81: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/81.jpg)
What is changing about the architecture?What kind of data access patterns are we using?How much traffic, how many queries?What metrics are we collecting?Are there automated alerts? How do we know the thresholds are right?How do we turn it off?... and what happens when we do?
“What could possibly go wrong?”
![Page 82: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/82.jpg)
What is changing about the architecture?What kind of data access patterns are we using?How much traffic, how many queries?What metrics are we collecting?Are there automated alerts? How do we know the thresholds are right?How do we turn it off? ...and what happens when we do?
“What could possibly go wrong?”
![Page 83: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/83.jpg)
“GameDay” Exercises
![Page 84: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/84.jpg)
Tuesday, April 24, 12
![Page 85: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/85.jpg)
Tuesday, April 24, 12
Pedro
![Page 86: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/86.jpg)
Surprise!!!Turning off multi-language supportimproves our page generation times by up to 25%.
Homepage (95th perc.)
![Page 87: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/87.jpg)
(Blameless) Post-Mortems
![Page 88: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/88.jpg)
How could this have gone better?
How quickly did we find out that something was wrong?Did we communicate well to our visitors and each other?Why did we have confidence that what we were doing was OK?Did we have the right tools, did we use them properly?Did we collect metrics, and could we find them?Where did we make the wrong decisions?
What steps do we take to reduce the chance of this happening again in the future?
![Page 89: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/89.jpg)
“... an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure.
This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.”
http://codeascraft.etsy.com/2012/05/22/blameless-postmortems/
John AllspawVP, Technical Operations, Etsy
![Page 90: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/90.jpg)
We should try to learn not only what went wrong, but also what went right.
![Page 91: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/91.jpg)
00:00Site down for maintenance
+01:47Site up, disabled login and registration
+06:40Site up, some seller tools disabled
+07:41All features restored
DB Server Maintenance, June 16, 2012http://etsystatus.com/2012/06/16/planned-outage-june-16th-7am-gmt/
![Page 92: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/92.jpg)
Operational Mindset
OpsDev Product
![Page 93: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/93.jpg)
Business Priorities
Operational Mindset
OpsDev Product
![Page 94: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/94.jpg)
Introspection
![Page 95: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/95.jpg)
!"#$ %&$'( )*+ $++*+ ,$-!.",$
![Page 96: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/96.jpg)
!"#$ %&$'( )*+ $++*+ ,$-!.",$...or, how are we screwing our users?
![Page 97: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/97.jpg)
Risk mitigation in a complex system
Redundant system architectures.Small, well-understood changes to production.Control application using config flags.Gratuitous metrics collection.Resilient user interfaces.GameDay exercises.
![Page 99: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/99.jpg)
![Page 100: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/100.jpg)
![Page 101: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/101.jpg)
![Page 102: On Failure and Resilience](https://reader033.vdocuments.us/reader033/viewer/2022052822/554bb095b4c905b3618b5906/html5/thumbnails/102.jpg)
Flickr: roboppyhttp://www.flickr.com/photos/51035735481@N01/163374138/
Flickr: jamesjyuhttp://www.flickr.com/photos/32593095@N00/3465022/
Flickr: circulatinghttp://www.flickr.com/photos/26835318@N00/2318226026/
PHOTO CREDITS