using saltstack to auto triage and remediate production systems
TRANSCRIPT
![Page 1: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/1.jpg)
Michael Kehoe Senior Site Reliability Engineer
Using SaltStack to Auto Triage and Remediate Production Systems
![Page 2: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/2.jpg)
Topics
• $ whoami• Salt @ LinkedIn• LinkedIn’s auto remediation story• Nurse• Salt API• Auto-remediation Salt Modules• How to get started
![Page 3: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/3.jpg)
$ whoami
![Page 4: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/4.jpg)
Salt @ LinkedIn
• 11-14k minions per master• Up to 8.5 events/ second per master• Deployment system heavily utilizes Salt• 5 dedicated SRE’s work on LinkedIn’s Salt infrastructure
• Number of SRE’s who also contribute
![Page 5: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/5.jpg)
Alert Growth vs NOC EngineersLinkedIn’s Auto-remediation Story
2010 2011 2012 2013 2014 2015 2016 20170
5000
10000
15000
20000
25000
30000
35000
40000
0
5
10
15
20
25
30
Num of AlertsNum of Engineers
![Page 6: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/6.jpg)
The problemLinkedIn’s Auto-remediation Story
• Growing number of high priority alerts vs Engineers to watch• Complicated ITR’s (runbooks) took time for NOC engineers to
execute and escalate• SRE’s would forget to run diagnostic tooling during outages• Longer MTTR == Bad experience for members
![Page 7: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/7.jpg)
Building the SolutionLinkedIn’s Auto-Remediation Story
• LinkedIn needed a workflow engine• Requirements
• Take automated actions against applications/ hosts• Perform the run book automatically• Appropriate auditing• Scalable
![Page 8: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/8.jpg)
What’s already out thereAutoremediation
• StackStorm• fbar• SaltStack
Started building an in-house solution in Mid-2014
![Page 9: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/9.jpg)
Nurse
Image:freeflaticons.com
![Page 10: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/10.jpg)
Nurse
• ‘Event based’ system• Built on top of LinkedIn’s existing monitoring infrastructure• Allows for manual execution via Web UI• Allows for other systems to plug-in via REST API
• Built in rules-engine• Allows for complicated checks/ branching
• Global environment awareness• Scalable!
![Page 11: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/11.jpg)
Nurse
![Page 12: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/12.jpg)
Nurse
![Page 13: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/13.jpg)
The basicsSalt REST API
• RESTful interface to SALT• Allows for external-authentication
• PAM• LDAP• Your own plugin…
• Built on top of CherryPy
![Page 14: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/14.jpg)
InterfaceSalt REST API
• /login - Log in to receive a session token• /logout - Remove or invalidate sessions• /minions – Working directly with minions• /jobs - Getting lists of previously run jobs or getting the return from a single job• /run - Run commands without normal session handling• /events - Expose the Salt event bus• /hook - A generic web hook entry point that fires an event on Salt's event bus• /keys – Wrapper around key management• /ws - Open a WebSocket connection to Salt's event bus• /stats - Return a dump of statistics collected from the CherryPy server
https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
![Page 15: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/15.jpg)
ConfigurationSalt REST API
external_auth: ldap: headless-nurse: - '*': - test.* - nurse.* - '@jobs'
https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
/etc/salt/master.d/salt-api.conf
![Page 16: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/16.jpg)
ConfigurationSalt REST API
rest_cherrypy: port: 8888 ssl_crt: /etc/salt/pki/api/salt-api.crt ssl_key: /etc/salt/pki/api/salt-api.key
https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html
/etc/salt/master.d/salt-api.conf
![Page 17: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/17.jpg)
ConfigurationSalt REST API
auth.ldap.basedn: 'dc=example,dc=com'auth.ldap.binddn: ’readonly-ldap@example'auth.ldap.bindpw: ’password'auth.ldap.filter: sAMAccountName={{ username }}auth.ldap.server: ldap.example.comauth.ldap.tls: Trueauth.ldap.persontype: 'person'auth.ldap.groupclass: 'group'auth.ldap.groupou: 'Users’
https://docs.saltstack.com/en/latest/topics/eauth/index.html
/etc/salt/master.d/salt-api.conf
![Page 18: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/18.jpg)
Code InterfacesSalt REST API
• There is a Python interface to the Salt API – pepper• Provides basic wrapper around REST API• https://github.com/saltstack/pepper/
• See https://github.com/SUSE/salt-netapi-client for JAVA bindings
![Page 19: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/19.jpg)
Python Code exampleSalt REST API
from pepper.libpepper import Pepper
api = Pepper('https://salt.example.com:8888')api.login('saltdev', 'saltdev', ’ldap')
# Run simple functionapi.low([{'client': 'local', 'tgt': '*', 'fun': 'test.ping'}])
# Execute a runner functionapi.runner('jobs.lookup_jid', jid=12345)
![Page 20: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/20.jpg)
GoalsAuto-remediation Salt Modules
• Auto-triage issues• Reduce context-switching• Run labor intensive tasks automatically• Gather data while engineer is logging in after escalation
• Auto-remediate issues• Let the engineer sleep• Faster MTTR
![Page 21: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/21.jpg)
ImplementationAuto-remediation Salt Modules
• Auto-triage issues• Identify abusive clients• Collect & analyses thread/ heap dumps• Parse & summarize log files
![Page 22: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/22.jpg)
ImplementationAuto-remediation Salt Modules
• Auto-remediate issues• Restart/ OOR applications• Scale-up applications• Blocking abusive clients• Update A/ B experiment definitions• Otherwise escalate…
![Page 23: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/23.jpg)
ImplementationAuto-remediation Salt Modules
• Plenty of Salt modules/ runners already out there• Over 300 modules are available in Salt core• Modules to take remediate & notify
• Write your own• Make sure you test!
![Page 24: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/24.jpg)
SuccessLinkedIn’s Auto-remediation Story
• 854k actions taken• 100% of service health check alerts are on boarded• ~37k man hours have now been automated• Now automating ~1100 hours/ week
![Page 25: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/25.jpg)
SuccessLinkedIn’s Auto-remediation Story
2010 2011 2012 2013 2014 2015 2016 20170
5000
10000
15000
20000
25000
30000
35000
40000
0
5
10
15
20
25
30
Num of AlertsNum of Engineers
![Page 26: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/26.jpg)
Some lessons learntHow to get started
• Need to make decision on how you architecture YOUR ‘event bus’
• Use external monitoring to trigger Salt via API• Use reactors/ beacons internally within Salt’s event bus• See ‘Thorium’ documentation for Salt’s new reactor engine
• Auditing is important!• Need to know what/ who triggered actions
![Page 27: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/27.jpg)
Some lessons learntHow to get started
• Reporting is important!• Need to know how often automated actions are being taken• Find failure hotspots• Leverage event-bus or returners
• Safety first• Don’t give yourself too much rope…
![Page 28: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/28.jpg)
Conclusion
• Think carefully about your ‘Event-Driven-Automation’ architecture
• The sky is the limit with Salt…don’t limit yourself• Again…safety first
![Page 29: Using SaltStack to Auto Triage and Remediate Production Systems](https://reader036.vdocuments.us/reader036/viewer/2022082217/5875f1a01a28ab006e8b5093/html5/thumbnails/29.jpg)
29
Questions?Thank You