resolution for a million databases lessons from …...lessons from automatic incident resolution for...
TRANSCRIPT
![Page 1: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/1.jpg)
SRECon EU, July 2016
Greg Burek
Lessons from Automatic Incident Resolution for a Million Databases
![Page 2: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/2.jpg)
![Page 3: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/3.jpg)
The Twelve-Factor App
![Page 4: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/4.jpg)
![Page 5: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/5.jpg)
![Page 6: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/6.jpg)
Department of Data
![Page 7: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/7.jpg)
PostgresqlRedisKafka
![Page 8: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/8.jpg)
~ Million Databases
Tens of thousands of AWS Instances
![Page 9: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/9.jpg)
Some Databases
Hundreds of AWS Instances
![Page 10: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/10.jpg)
![Page 11: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/11.jpg)
“The goal is to build systems that can scale linearly with
machines & sub-linearly with people” - Caitie McCaffrey
Tackling Alert Fatigue
![Page 12: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/12.jpg)
Monitor and alert on your business
![Page 13: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/13.jpg)
![Page 14: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/14.jpg)
Monitor and alert on your business
Usually, don’t alert on machine specific metrics
![Page 15: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/15.jpg)
Write runbooks and playbooks
![Page 16: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/16.jpg)
Turn playbooks into code
![Page 17: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/17.jpg)
![Page 18: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/18.jpg)
“The goal is not to never get paged, the goal is to never get
paged for the same thing twice” - Astrid Atkinson
Engineering for the long game
![Page 19: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/19.jpg)
Verify monitoring before restarting the world
![Page 20: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/20.jpg)
![Page 21: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/21.jpg)
Circuit breakers
![Page 22: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/22.jpg)
Automation can’t handle the unknown
Wake someone up on exceptions and timeouts
![Page 23: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/23.jpg)
![Page 24: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/24.jpg)
Have a REPL/console
![Page 25: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/25.jpg)
Aggregate and review trends
![Page 26: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/26.jpg)
![Page 27: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/27.jpg)
Humans can break
Automation can be simplistic
Humans + Automation for a resilient and operable system
![Page 28: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/28.jpg)
1. Monitor and alert on your business2. Write playbooks3. Make playbooks into automation4. Checks and balances of automation5. Circuit breakers6. Alert on exceptions and timeouts7. Admin console8. Aggregate and review trends
![Page 29: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/29.jpg)
[email protected]@gregburek
![Page 30: Resolution for a Million Databases Lessons from …...Lessons from Automatic Incident Resolution for a Million Databases The Twelve-Factor App Department of Data Postgresql Redis Kafka](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed2fb1a9c98552a474001bd/html5/thumbnails/30.jpg)
State Machines