alt sre · what does alt sre look like in tcd? {an in house developed automation tool, (i.e. a...

17
ALT SRE Sean McGrath 8th November 2018 Resarch IT , Trinity College Dublin , [email protected]

Upload: others

Post on 11-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

ALT SRE

Sean McGrath

8th November 2018

Resarch IT , Trinity College Dublin , [email protected]

Page 2: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

SRE Site/System Reliability Engineering

Typical SRE activities fall into some of the following categories 1

– Software engineering

- E.g. automation scripts, ... adding service features forscalability and reliability, or modifying infrastructure to make itmore robust.

– Systems engineering

- E.g. Configuring production systems.

– Toil

- the kind of work tied to running a production service thattends to be manual, repetitive, automatable, tactical, devoid ofenduring value, and that scales linearly as a service grows.

1https:

//landing.google.com/sre/book/chapters/eliminating-toil.htmlResarch IT , Trinity College Dublin , [email protected]

Page 3: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

What does ”ALT” mean?

– Short for alternative

– Alternative Site Reliability Engineering, (ALT SRE), is like theSRE Google does, but not as good

– It’s much worse actually!

Resarch IT , Trinity College Dublin , [email protected]

Page 4: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

Resarch IT , Trinity College Dublin , [email protected]

Page 5: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

Resarch IT , Trinity College Dublin , [email protected]

Page 6: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

What’s a Science HPC Cluster?

Cluster’s have lots of servers that can sometimes fail.

(Often because the end user does cunning things to break them).

Resarch IT , Trinity College Dublin , [email protected]

Page 7: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

Cluster’s in Trinity2

2https://www.tchpc.tcd.ie/resources/clustersResarch IT , Trinity College Dublin , [email protected]

Page 8: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

Cluster’s in Trinity

Resarch IT , Trinity College Dublin , [email protected]

Page 9: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

That seems like a lot of servers that could breakfrom time to time

Resarch IT , Trinity College Dublin , [email protected]

Page 10: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

What my mornings used to look like

1. Check for whats broken

2. Perform an action from a small list of common actions on thenodes based on what the queue management software tellsyou. E.g.

2.1 Reboot the node to clear out of memory errors2.2 Restart certain services2.3 Etc.

Repeat until all the nodes are back healthy or I get distracted.

A lot of work that was

– Manual

– Repetitive

– Automatable

– Tactical

– Scaled linearly as a service growResarch IT , Trinity College Dublin , [email protected]

Page 11: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

Seems like Toil

Resarch IT , Trinity College Dublin , [email protected]

Page 12: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

The solution?

Resarch IT , Trinity College Dublin , [email protected]

Page 13: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

What does ALT SRE look like in TCD?

– An in house developed automation tool, (I.e. a script),

– That ties together the common tools, (sometimes just assimple as ’service blah restart’ on a server, or remotely powercycling it),

– Run periodically to get the clusters to heal themselves ofcommon problems and free up staff for other, higher order,tasks.

– https://github.com/smcgrat/linux-general/blob/master/self-heal.sh

Resarch IT , Trinity College Dublin , [email protected]

Page 14: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

Examples of what is looked for

declare -a nodes_with_problems

declare -a nodes_down_with_HC_errors

declare -a nodes_that_need_cluster_tests_run

declare -a nodes_that_are_draining

declare -a nodes_not_responding

declare -a nodes_with_epilog_errors

Resarch IT , Trinity College Dublin , [email protected]

Page 15: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

Example of what it might do

f u n c t i o n r e s t a r t n o d e {l o c a l node=$1i f [ ” $dryrunmode” != ”on” ] ; then

i f [ ”$quorumnode” == ”no” ] ; then # t h i s i s not a member o f the quorum , s a f e to r eboo t/ u s r / b i n / s c o n t r o l r e boo t node s $nodeecho ”$node r e b o o t i n g ”op e r a t i o n=”$op e r a t i o n node r eboo t ed − ”

e l s eecho ”$node i s a quorum node − don ’ t r eboo t ”supdate $node d r a i n ”SH : OOMquorumnode”op e r a t i o n=”$op e r a t i o n quorum node tha t needs a r eboo t − ”

f ie l s e

echo ”$node − s e r v i c e s not be i ng r e s t a r t e d as s c r i p t has been run i n dry run mode”op e r a t i o n=”$op e r a t i o n node r eboo t ed − ”

f i}

Resarch IT , Trinity College Dublin , [email protected]

Page 16: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

Comments / Observations

At small scale anyone can SRE, (not necessarily well though).

Some of the poor design decisions made along the way

– Writing it in bash! Particularly 686 lines of it. Thats just silly.

Some of the pitfalls of having a self healing cluster

– Sometimes you actually want your cluster to be off, so havinga tool that automatically turns it back on may not be optimal!

Some possible future developments

– Re-write in Python

– Write as a finite state machine?

Resarch IT , Trinity College Dublin , [email protected]

Page 17: ALT SRE · What does ALT SRE look like in TCD? {An in house developed automation tool, (I.e. a script), {That ties together the common tools, (sometimes just as simple as ’service

Thank You!

Questions?

Resarch IT , Trinity College Dublin , [email protected]