saltconf14 - thomas jackson, linkedin - safety with power tools

42
Site Reliability Engineering ©2013 LinkedIn Corporation. All Rights Reserved. Safety with power tools Learnings about Salt @ LinkedIn

Post on 17-Oct-2014

594 views

Category:

Technology


2 download

DESCRIPTION

As infrastructure scales, simple tasks become increasingly difficult. For large infrastructures to be manageable, we use automation. But automation, like any power tool, comes with its own set of risks and challenges. Automation should be handled like production code, and great care should be exercised with power tools. This talk will cover how SaltStack is used at LinkedIn and offer tips and tricks for automating management with SaltStack at massive scale including a look at LinkedIn-inspired Salt features such as blacklist and prereq states. It will also cover Salt master and minion instrumentation and a compilation of how not to use Salt.

TRANSCRIPT

Page 1: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved.

Safety with power toolsLearnings about Salt @ LinkedIn

Page 2: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 2

Who’s this guy?

Page 3: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 3

What is SRE?

Hybrid of operations and engineering Heavily involved in architecture and design Application support ninjas Masters of automation

Page 4: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 4

So, what do I do with salt?

Heavy user Active developer Administrator (less so)

Page 5: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 5

What’s LinkedIn?

Professional social network You probably all have an account You probably all get email from us too

Page 6: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 6

Salt @ LinkedIn

When LinkedIn started– Aug 2011: Salt 0.8.9– ~5k minions

When I got involved– May 2012: Salt 0.9.9– ~10k minions

Today– Now: 2014.01– ~30k minions

Page 7: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 7

How should you manage a service?

Page 8: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 8

That’s not much of an answer…

Depends on use– Home– School– Hack– Work

How you manage the service changes over time– Make it work – very manual long time to get it to work (more of a work of art…)– Reproducibly make it work– Script it out– And more?

Page 9: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved.

Apache Traffic Server

Page 10: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved.

ATS: Apache Traffic Server

Fast, scalable and extensible HTTP/1.1 compliant caching proxy server. Non-blocking IO Plugin architecture

This is the real logo

Page 11: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 11

Example: ATS deployment @ LinkedIn

When I started, deployment was less than ideal:– Check into SVN– SCP files to hosts– Manually remove host from rotation– Replace files and install RPMs– Restart trafficserver– Check some logs to see if its broken– Put it in rotation and hope you didn’t miss anything

Page 12: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 12

Page 13: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 13

Example: ATS deployment @ LinkedIn

So many steps!– Manual config management– Manual rpm deployment– Manual * (<- seriously, you name it!)

Works for a while, but doesn’t scale Very VERY error prone

Page 14: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved.

Solution? Automation with Salt!

Pillars, runners, and modules, Oh My! States make this dead simple

Page 15: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 15

Obligatory SLS formulas

ats: pkg: - installed - pkgs: - trafficserver: x.x.x-xx - trafficserver-plugin-header-rewrite: x.x.x-x ... (there are lots) service: - name: trafficserver - running

/etc/trafficserver/records.config: file.managed: - makedirs: True - user: nobody - group: nobody - mode: 600 - source: http://repo/ats/records.config - source_hash: md5=20d90b82bb3a4f95d7f17d1be6257246

Page 16: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 16

Great, SLS– like I wasn’t going to see those @ SaltConf

Had to, sorry!

Page 17: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 17

What is Salt?

Page 18: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 18

What is Salt @ LinkedIn?

Remote execution– Salt \* cmd.run date -s "`date`” (leap-pocalypse anyone?)

“Catchall” deployment system– ATS– Couchbase– Etc.

Automation platform– Remote execution behind LinkedIn’s new standardized deployment– Cache copy + torrent-style file distribution (in migration to Salt!)

Page 19: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 19

So what’s this about power tools?

Growing up my dad and I did a lot of cabinetry work In the old days you did all this by hand There are actually quite a few similarities

Page 20: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 20

Learning to be a carpenter

Learning in general you start with the basics and move up – Calculator-less math classes anyone?

Carpentry 101: learn the basic tools– Hand saws– Sandpaper– Hammer

Page 21: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 21

Learning to be a carpenter

As a kid I always thought it was ridiculous to use these since I could *see* the power tools my dad was using

With more experience you can use more tools, once you know how to use the ones you have

– Tools need to be respected and used properly– Some tools aren’t worth learning the hard way (chainsaws!)

Page 22: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 22

So, SaltConf is about carpentry??

Well, not so much

Computers have lots of different tools– ssh– scp– Package managers– Etc.

As we scale it’s no longer practical to use all these manual tools, so we use power tools (automation)

Page 23: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 23

How should you use Salt?

Understand the problem Learn the tool Test the solution Watch for the result

Page 24: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 24

How should you use Salt: Understand the problem

“If you can't explain it simply, you don't understand it well enough.”– Albert Einstein

What are you trying to automate?– Is this full stack? Or just the application?– What is already automated?– Should it be automated?

Learn how to do it without the tooling– Knowing how to do the deploy manually will help you when you need to debug

Page 25: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 25

How should you use Salt: Learn the tool

“99% of the time you don’t have to write modules to use salt” – *Most* things you want to do can be done with existing code– If you find something that you think needs new code, reach out to the

community– someone else probably wants it too! Learn what it can and can’t do Keep up with new features coming out as well as coming up Continually train yourself and your users

Little things can add up:– In your __virtual__ function check your dependencies(~5 lines x ~30K minions)

Page 26: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 26

How should you use Salt: Test the Solution

Don’t’ be that guy

Page 27: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 27

How should you use Salt: Test the Solution

Fact: “AUTOMATION IS CODE!”

It is common to set up extensive tests for code, but less so for automation In many ways automation testing is just as if not more important!

– This applies to SLS formulas, modules, runners, AND salt itself.– Staging is production for infrastructure!

Page 28: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 28

How should you use Salt: Test the Solution

How do we do this @ LinkedIn?– Code reviews– VM environment: a pre-staging environment for testing– Stress tests: pathological test cases– Canary process: careful code rollouts

Page 29: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 29

How should you use Salt: Watch for the result

Once we’ve tested our automation, we need to verify that it does what we expect.

– Code can sometimes have unintended consequences

Page 30: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 30

Innocent enough right?

@_withJMXConnectiondef domains(connection): ''' returns a list of domains available ''' domains = list(connection.getDomains()) domains.sort() return domains

Wait, what’s that decorator?

Page 31: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 31

See the problem?

class _withJMXConnection(object): connection = None def __init__(self, fn, url): self.fn = fn if not _withJMXConnection.connection: # set up a jmx connection ... jpype.startJVM(“libjvm.so", "-Dcom.sun.management.jmxremote.authenticate=false", "-Xms20m", "-Xmx20m") jmxurl = jpype.javax.management.remote.JMXServiceURL(url) jmxsoc = jpype.javax.management.remote.JMXConnectorFactory.connect(jmxurl) _withJMXConnection.connection = jmxsoc.getMBeanServerConnection() self.connection = _withJMXConnection.connection

Spins up a JVM!

Page 32: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 32

How should you use Salt: Watch it

Once we’ve tested our automation, we need to verify that it does what we expect.

– Code can sometimes have unintended consequences What metrics do we watch?

– CPU (load and utilization)– Memory (real AND virtual)– TCP sessions (and overflows!)– Event bus (MasterEvent and MinionEvent)– Etc.

Page 33: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 33

Now everything is AWESOME!!!

Page 34: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 34

NOPE! Still can have problems

Page 35: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 35

Problems @ scale

timeouts that didn’t work – (#3431) original implementation relied on the zmq poller timeout, which you

never hit if the event bus was relatively busy salt-master memory leaks (all gone now )

– Zeromq3– Reaping master child processes which crash

Performance problems on master (we’ve dropped CPU usage by ~80%)– Change max open files check to not run per minion request– Don't load minion modules every pillar call

Slow yumpkg5 module– Went from 20s -> 60s! Now down to ~9s (for 55 packages)

Page 36: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 36

Other features we’ve added

yumpkg – support for specific versions (back in the day)– major performance enhancements to the yumpkg module

Compound matchers (range & minion data) Prereq state Client_acl_blacklist Check and set (cas) to the data module depends decorator iterative file hashing in fileclient hash cache for fileserver + hash cache reaping limit memory consumption on module load in *nix kwarg passing with types Profiler within master process

Page 37: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 37

client_acl_blacklist (new in 0.13.0)

Salt had support for whitelisting, and per-user access control Wanted to blacklist certain modules/users

– No root (require sudo)– No cmd module (protect against fat-fingering)

client_acl_blacklist: users: - root - '^(?!sudo_).*$' # all non sudo users modules: - cmd

Page 38: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 38

Prereq state (new in 0.16.0)

Came up as we started migrating our deployments to salt states Motivation was to take hosts out of rotation before deployment This feature lets us remove our own custom wrappers!

graceful-down: cmd.run: - name: service apache graceful - prereq: - file: site-code

site-code: file.recurse: - name: /opt/site_code - source: salt://site/code

Page 39: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 39

Kwarg passing with types

Found while trying to pass a pillar as a kwarg to a module (p.s. don’t) Kwargs were cast as strings and passed as an arg

– Fine if the __str__ representation == yaml– Problem if the __str__ representation != yaml

Put all kwargs in a single dict (marked as the kwarg dict) to maintain type

Page 40: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 40

Takeaways

Respect the tool!– Understand the problem– Learn the tool– Test the solution– Watch for the result

Be active in the community Don’t just consume, Contribute! Have FUN!

Page 41: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools

Site Reliability Engineering©2013 LinkedIn Corporation. All Rights Reserved. 41

Got more questions about Salt @ LinkedIn

Interested in how we manage Salt @ Scale?– Breakout session with Craig Sebenik @ 11:15 am in Sundance

Got questions?– Drop by our SaltConf booth!– Connect with me on LinkedIn www.linkedin.com/in/jacksontj– Jacksontj on #salt on freenode

Page 42: SaltConf14 - Thomas Jackson, LinkedIn - Safety with Power Tools