alt sre · what does alt sre look like in tcd? {an in house developed automation tool, (i.e. a...
TRANSCRIPT
![Page 2: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/2.jpg)
SRE Site/System Reliability Engineering
Typical SRE activities fall into some of the following categories 1
– Software engineering
- E.g. automation scripts, ... adding service features forscalability and reliability, or modifying infrastructure to make itmore robust.
– Systems engineering
- E.g. Configuring production systems.
– Toil
- the kind of work tied to running a production service thattends to be manual, repetitive, automatable, tactical, devoid ofenduring value, and that scales linearly as a service grows.
1https:
//landing.google.com/sre/book/chapters/eliminating-toil.htmlResarch IT , Trinity College Dublin , [email protected]
![Page 3: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/3.jpg)
What does ”ALT” mean?
– Short for alternative
– Alternative Site Reliability Engineering, (ALT SRE), is like theSRE Google does, but not as good
– It’s much worse actually!
Resarch IT , Trinity College Dublin , [email protected]
![Page 4: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/4.jpg)
Resarch IT , Trinity College Dublin , [email protected]
![Page 5: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/5.jpg)
Resarch IT , Trinity College Dublin , [email protected]
![Page 6: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/6.jpg)
What’s a Science HPC Cluster?
Cluster’s have lots of servers that can sometimes fail.
(Often because the end user does cunning things to break them).
Resarch IT , Trinity College Dublin , [email protected]
![Page 7: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/7.jpg)
Cluster’s in Trinity2
2https://www.tchpc.tcd.ie/resources/clustersResarch IT , Trinity College Dublin , [email protected]
![Page 9: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/9.jpg)
That seems like a lot of servers that could breakfrom time to time
Resarch IT , Trinity College Dublin , [email protected]
![Page 10: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/10.jpg)
What my mornings used to look like
1. Check for whats broken
2. Perform an action from a small list of common actions on thenodes based on what the queue management software tellsyou. E.g.
2.1 Reboot the node to clear out of memory errors2.2 Restart certain services2.3 Etc.
Repeat until all the nodes are back healthy or I get distracted.
A lot of work that was
– Manual
– Repetitive
– Automatable
– Tactical
– Scaled linearly as a service growResarch IT , Trinity College Dublin , [email protected]
![Page 13: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/13.jpg)
What does ALT SRE look like in TCD?
– An in house developed automation tool, (I.e. a script),
– That ties together the common tools, (sometimes just assimple as ’service blah restart’ on a server, or remotely powercycling it),
– Run periodically to get the clusters to heal themselves ofcommon problems and free up staff for other, higher order,tasks.
– https://github.com/smcgrat/linux-general/blob/master/self-heal.sh
Resarch IT , Trinity College Dublin , [email protected]
![Page 14: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/14.jpg)
Examples of what is looked for
declare -a nodes_with_problems
declare -a nodes_down_with_HC_errors
declare -a nodes_that_need_cluster_tests_run
declare -a nodes_that_are_draining
declare -a nodes_not_responding
declare -a nodes_with_epilog_errors
Resarch IT , Trinity College Dublin , [email protected]
![Page 15: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/15.jpg)
Example of what it might do
f u n c t i o n r e s t a r t n o d e {l o c a l node=$1i f [ ” $dryrunmode” != ”on” ] ; then
i f [ ”$quorumnode” == ”no” ] ; then # t h i s i s not a member o f the quorum , s a f e to r eboo t/ u s r / b i n / s c o n t r o l r e boo t node s $nodeecho ”$node r e b o o t i n g ”op e r a t i o n=”$op e r a t i o n node r eboo t ed − ”
e l s eecho ”$node i s a quorum node − don ’ t r eboo t ”supdate $node d r a i n ”SH : OOMquorumnode”op e r a t i o n=”$op e r a t i o n quorum node tha t needs a r eboo t − ”
f ie l s e
echo ”$node − s e r v i c e s not be i ng r e s t a r t e d as s c r i p t has been run i n dry run mode”op e r a t i o n=”$op e r a t i o n node r eboo t ed − ”
f i}
Resarch IT , Trinity College Dublin , [email protected]
![Page 16: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service](https://reader036.vdocuments.us/reader036/viewer/2022071214/6044005997168d4f8c4deefc/html5/thumbnails/16.jpg)
Comments / Observations
At small scale anyone can SRE, (not necessarily well though).
Some of the poor design decisions made along the way
– Writing it in bash! Particularly 686 lines of it. Thats just silly.
Some of the pitfalls of having a self healing cluster
– Sometimes you actually want your cluster to be off, so havinga tool that automatically turns it back on may not be optimal!
Some possible future developments
– Re-write in Python
– Write as a finite state machine?
Resarch IT , Trinity College Dublin , [email protected]