behind the scenes at my spacecom

57
Behind the Scenes at MySpace.com Dan Farino Chief Systems Architect [email protected] 1 Friday, November 21, 2008

Upload: consanfrancisco123

Post on 15-May-2015

472 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: Behind The Scenes At My Spacecom

Behind the Scenes at MySpace.com

Dan FarinoChief Systems [email protected]

1Friday, November 21, 2008

Page 2: Behind The Scenes At My Spacecom

Topics

• Architecture overview and history

• The stuff I get to work on (in the Windows world)• Monitoring• AdministraGon

2

2Friday, November 21, 2008

Page 3: Behind The Scenes At My Spacecom

Topics

• Windows?!• It’s a good server (now leave me alone.)

• However, the selecGon of tools for large‐scale management is a bit sparse...

3

3Friday, November 21, 2008

Page 4: Behind The Scenes At My Spacecom

Where we started 

4Friday, November 21, 2008

Page 5: Behind The Scenes At My Spacecom

Where we started

• The ideal growth scenario• Plan• Implement• Test• Go live• Monitor and collect ops data• Repeat

5

5Friday, November 21, 2008

Page 6: Behind The Scenes At My Spacecom

Where we started

• Our growth scenario:• Implement• Go live

• And while those are happening over and over:• Reboot servers• Throw hardware at performance issues• “Shotgun debugging”

6

6Friday, November 21, 2008

Page 7: Behind The Scenes At My Spacecom

Where we started

“Shotgun debugging”:

Shotgun debugging is a process of making relaGvely undirected changes to soVware in the hope that a bug will be perturbed out of existence.

7

7Friday, November 21, 2008

Page 8: Behind The Scenes At My Spacecom

Where we started

• Why would anyone “shotgun debug”?

• Don’t really know how to analyze and debug a problem

• Need to resolve the problem now and collecGng data for analysis would take too long

8

8Friday, November 21, 2008

Page 9: Behind The Scenes At My Spacecom

Where we started

• Web servers• Windows 2000 Server• IIS 5.0• ColdFusion 5

• Database servers• Windows 2000 Server• SQL Server 2000

9

9Friday, November 21, 2008

Page 10: Behind The Scenes At My Spacecom

Where we were

• OperaGonally• Batch files and robocopy for code deployment

• “psexec” for remote admin script execuGon

• Windows Performance Monitor for monitoring

10

10Friday, November 21, 2008

Page 11: Behind The Scenes At My Spacecom

Where we were

• Any sort of formal, automated QA process?• No.

11

11Friday, November 21, 2008

Page 12: Behind The Scenes At My Spacecom

Current architecture

12Friday, November 21, 2008

Page 13: Behind The Scenes At My Spacecom

Current architecture

• 4,500+ web servers• Windows 2003/IIS 6.0/ASP.NET

• 1,200+ “cache” servers• 64‐bit Windows 2003

• 500+ database servers• 64‐bit Windows 2003• SQL Server 2005

13

13Friday, November 21, 2008

Page 14: Behind The Scenes At My Spacecom

QA today

• Unit tests/automated tesGng• We sGll don’t “fuzz” the site nearly as thoroughly as our users do though

• There are sGll problems that happen only in producGon

14

14Friday, November 21, 2008

Page 15: Behind The Scenes At My Spacecom

QA today

• We need beher operaGonal data collecGon so that we know what cases we’re not tesGng

15

15Friday, November 21, 2008

Page 16: Behind The Scenes At My Spacecom

OperaGonal Data CollecGon

16Friday, November 21, 2008

Page 17: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Two general types of systems:• StaGc

• Collect, store and alert based on pre‐configured rules

• Dynamic• Write an ad‐hoc script or applicaGon to collect data for an immediate or one‐off need

17

17Friday, November 21, 2008

Page 18: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Our current “staGc” Windows Performance counter monitor:

18Friday, November 21, 2008

Page 19: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Cons of staGc system:• RelaGvely central configuraGon managed by a small number of administrators

• Bad for one‐off requests: change the config, apply, wait for data

• Developer’s quesGons usually go unanswered

19

19Friday, November 21, 2008

Page 20: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Developers looking at producGon?!• Developers like to see their creaGons come to life (I know I do)

• The more a developer can see once their code goes live, the more they’re going to know for V2

20

20Friday, November 21, 2008

Page 21: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Cons of the dynamic system:• It’s not really a “system” at all...it’s an administrator running a script

• Is a privileged operaGon: scripts are powerful and can potenGally make changes to the system

• Even run as a limited user, bad scripts can sGll DoS the system

21

21Friday, November 21, 2008

Page 22: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Cons of the dynamic system:• One‐shot data collecGon is possible but learning about deltas takes a lot more code (and polling, yuck)

• Different custom‐data collecGon tools that request the same data point cause duplicated network traffic

22

22Friday, November 21, 2008

Page 23: Behind The Scenes At My Spacecom

Ops Data CollecGon

• A recent example of an ad‐hoc task using our current “dynamic” system:• get‐adservers | run‐agent ps /e '"Version: $(gcm F:\file.dll | %{$_.FileVersionInfo.FileVersion})"' | select Host, Message

23

23Friday, November 21, 2008

Page 24: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Ideally, all operaGonal data available in the enGre server farm should be able to queried:• Safely• Instantly• With change‐noGficaGon

24

24Friday, November 21, 2008

Page 25: Behind The Scenes At My Spacecom

Ops Data CollecGon

• I’d like to be able to do something like this:• SELECT CpuTime.*, ExceptionsPerSecondWHERE WebService.Status = ‘UP’AND serving = ‘profile.myspace.com’OR serving = ‘home.myspace.com’

25

25Friday, November 21, 2008

Page 26: Behind The Scenes At My Spacecom

Ops Data CollecGon

I’d also to be able to leave that query “hanging” and be noGfied of changes like:• A selected field has changed for a known data point

• A new server has come online and meets the criteria (or vice‐versa)

26

26Friday, November 21, 2008

Page 27: Behind The Scenes At My Spacecom

Our new operaGonal data collecGon plalorm

27Friday, November 21, 2008

Page 28: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Our new operaGonal data‐subscripGon plalorm:

• On‐demand• Supports both “one‐shot” and “persistent” modes

• Can be used by non‐privileged users

28

28Friday, November 21, 2008

Page 29: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Our new operaGonal data‐subscripGon plalorm:

• Eliminates the need for the consumer to poll for changes

• If a data source requires polling, that operaGon is pushed as close to the source as possible

29

29Friday, November 21, 2008

Page 30: Behind The Scenes At My Spacecom

Ops Data CollecGon

• A Client makes one TCP connecGon to a “Collector” server

• Can receive data related to thousands of servers via this one connecGon

• As long as the connecGon is up, the client is kept up‐to‐date

30

30Friday, November 21, 2008

Page 31: Behind The Scenes At My Spacecom

Ops Data CollecGon

• A lihle bit like:• Having all of the servers in a chat room and being able to talk to a selected subset of them at any Gme (over one connecGon)

• IniGal idea came from looking at using XMPP+ejabberd for command and control

31

31Friday, November 21, 2008

Page 32: Behind The Scenes At My Spacecom

Ops Data CollecGon

Collector Server

Agent Agent Agent

One lazily-establishedTCP connection per Agent

Client Client

Preferably one TCP connection per Client, although more than one is allowed (but frowned upon)

32

32Friday, November 21, 2008

Page 33: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Provides:• Windows Performance Counters• WMI objects

• Event logs• Hardware data• Custom WMI objects published from out‐of‐process

• Log file contents

33

33Friday, November 21, 2008

Page 34: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Provides:• On Linux, plans are to hook into something like D‐Bus so that processes can provide operaGonal data to the Agent in a loosely‐connected manner

34

34Friday, November 21, 2008

Page 35: Behind The Scenes At My Spacecom

Ops Data CollecGon

• The Collector service:• A Windows Service in C#• Completely async I/O (never blocks a thread)

• Uses MicrosoV’s “Concurrency and CoordinaGon RunGme”

• An Agent running on each host 35

35Friday, November 21, 2008

Page 36: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Wire protocol is Google’s Protocol Buffers

• Clients and Agents can be easily wrihen in any of the languages for which there is a PB implementaGon

36

36Friday, November 21, 2008

Page 37: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Why not use XMPP+ejabberd?

• Wanted to use Protocol Buffers instead of XML

• Wanted lazily‐established TCP connecGons to the Agents

• Wanted to see if C#+CCR could handle the load (yes it can)

37

37Friday, November 21, 2008

Page 38: Behind The Scenes At My Spacecom

Why develop a whole new plalorm?

38Friday, November 21, 2008

Page 39: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Why develop something new?• There doesn’t seem to be anything out there right now that fits the need

• And my requirements also include free and open source...

39

39Friday, November 21, 2008

Page 40: Behind The Scenes At My Spacecom

Ops Data CollecGon

• To do it properly, you really need to be using 100% asyncI/O.

• Libraries that make this easy are relaGvely new• CCR, Twisted, GTask, Erlang

40

40Friday, November 21, 2008

Page 41: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Most established products were wrihen before themulG‐core/async craze

41

41Friday, November 21, 2008

Page 42: Behind The Scenes At My Spacecom

Ops Data CollecGon

• What does it enable?• The individual that is actually interested in the data can gather it himself

• No central config, no need to involve an administrator

• This includes developers

42

42Friday, November 21, 2008

Page 43: Behind The Scenes At My Spacecom

Ops Data CollecGon

• What does it enable?• There is a very low “barrier to entry”

• It’s almost like exploring a database with some ad‐hoc SQL queries

• “I wonder...” quesGons are easily answered without a lot of work

43

43Friday, November 21, 2008

Page 44: Behind The Scenes At My Spacecom

Ops Data CollecGon

• What does it enable?• CharGng/alerGng/data‐archiving systems no longer concern themselves with thedata‐collecGon intricacies.

• We can spend Gme wriGng the valuable code instead of rewriGng the same plumbing every Gme

44

44Friday, November 21, 2008

Page 45: Behind The Scenes At My Spacecom

Ops Data CollecGon

• What does it provide?• Abstracts physical server‐farm from the user

• If you know machine names, great. But you can also say “all servers serving ‘profile.myspace.com’” or “all cache servers in Los Angeles”

45

45Friday, November 21, 2008

Page 46: Behind The Scenes At My Spacecom

Ops Data CollecGon

• What does it provide?• Guaranteed to keep you“up‐to‐date”

• Get your iniGal set of data and then just wait for the deltas

• Pushes polling as close to the source as possible

46

46Friday, November 21, 2008

Page 47: Behind The Scenes At My Spacecom

Ops Data CollecGon

• What does it provide?• Eliminates duplicate requests• Hundreds of clients can be monitoring the “% Processor Time” for a server and it will only be sent from that server once when it changes

47

47Friday, November 21, 2008

Page 48: Behind The Scenes At My Spacecom

Ops Data CollecGon

• What does it provide?• Only collects data that someone is currently asking for

• This is how we avoid having explicit configuraGon on the server

48

48Friday, November 21, 2008

Page 49: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Is this really a good way to do things?

• Having too much data pushed at you is a bad thing

• Being able to pull from a large selecGon of data points is a good thing

49

49Friday, November 21, 2008

Page 50: Behind The Scenes At My Spacecom

Ops Data CollecGon

• For developers, knowing that they will have access to instrumentaGon data even in produc4on encourages more detailed instrumentaGon

50

50Friday, November 21, 2008

Page 51: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Beher instrumentaGon = more data available = more detailed feedback to QA and developers

51

51Friday, November 21, 2008

Page 52: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Ease‐of‐use is a very high priority

• Easy and fun APIs encourage adopGon

52

52Friday, November 21, 2008

Page 53: Behind The Scenes At My Spacecom

Ops Data CollecGon

• LINQ via C#:

var collector = new Collector(...);

var counters =from server in collectorwhere server.subdomain = “www.myspace.com”select server.WindowsPerfCounterinto counterswhere counters.category = “Processor”select server.Name, counters.Instance,

counters.Value

53

53Friday, November 21, 2008

Page 54: Behind The Scenes At My Spacecom

Ops Data CollecGon

• LINQ via C# and CLINQ (“ConGnuous LINQ”) = instant monitoring app (in about 10 lines of code):

var counters = ...

MainWpfWindow.MainGrid = counters;

// Go grab a beer54

54Friday, November 21, 2008

Page 55: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Tail a file across thousands of servers

• With the filtering expression being run on the remote machines

• At the same Gme as someone else is (with no duplicate lines being sent over the network)

55

55Friday, November 21, 2008

Page 56: Behind The Scenes At My Spacecom

Ops Data CollecGon

• Open source?• Hopefully!

• Other implementaGons?• I may write a GTask or Erlang version as a “weekend” project

56

56Friday, November 21, 2008

Page 57: Behind The Scenes At My Spacecom

Thank you!

57Friday, November 21, 2008