it's all about telemetry

Post on 16-May-2015

6.037 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

The ins and outs of monitoring your technology enabled business.

TRANSCRIPT

It’s all about telemetryMonitoring what matters in a useful way.

Tuesday, June 26, 12

Theo Schlossnagle @postwait

I write software

I write books

I give talks

I participate in the industry

I speak frankly about industry issues

Tuesday, June 26, 12

Data, data, everywhere.

A billion pageviews / month.

100k database queries / second.

1MM memcache queries / second.

500k MQ messages / second.

10MM I/O operations / second.

Tuesday, June 26, 12

Most new big data problems are

solvable

Big Data

Tuesday, June 26, 12

Most new big data problems arecreated by our solutions, and thussolvabledespite their ROI

Most new big data problems are

solvable

Big Data

Tuesday, June 26, 12

That’s a whole lot of data

Think in terms of logs (too many do)

About 26 trillion log lines / month

@ 40 bytes compressed: 1PB / month

Just because it is possibledoes not mean it will return on investment(and does not mean it won’t)

Tuesday, June 26, 12

It’s all “useful”; which data?

Think in terms of cost/benefit.

Sure the data is useful, but it costs money to store

Does it cost you more to have it or not to have it?

Maybe the right approach is to keep that level of detail for a few days?

Tuesday, June 26, 12

Double-edged sword.

Eroding granularity over timekeeps storage under control

Tuesday, June 26, 12

Double-edged sword.

Eroding granularity over timekeeps storage under control

MISTAKE

Tuesday, June 26, 12

1 yearat a glance

Tuesday, June 26, 12

1 weeklooks normalish

Tuesday, June 26, 12

1 dayconfidence of normalcy increases

Tuesday, June 26, 12

1 weekthat looks different

Tuesday, June 26, 12

1 dayyup, that’s not at all like that other week

Tuesday, June 26, 12

Other methods

What do you store?

How do you store it?

Why is it useful?

Winning the cost benefit game byreducing costs more significantly thanreducing benefits

Tuesday, June 26, 12

0 0.5 1 1.5 2 2.5 3

0.25

0.5

0.75

1

Benefit

Cost

Positive ValueBe in the green.

monitoring activity ➠

Tuesday, June 26, 12

0 1 2 3 4 5 6 7 8 9 10

2.5

5

7.5

10

Benefit

Cost

There’s a bigger pictureIt’s not as easy as you think.

monitoring activity ➠

Tuesday, June 26, 12

0 0.5 1 1.5 2 2.5 3

0.25

0.5

0.75

1

Benefit

Cost

Value is difference, not areaGreen can be misleading

monitoring activity ➠

Tuesday, June 26, 12

0.5 1 1.5 2 2.5 3

-1

-0.75

-0.5

-0.25

0.25

0.5

Value = Benefit - CostGreen means we have positive return

monitoring activity ➠

Tuesday, June 26, 12

0.5 1 1.5 2 2.5 3

-1

-0.75

-0.5

-0.25

0.25

0.5

It’s not about returnWell, it’s not only about return

monitoring activity ➠

Tuesday, June 26, 12

0.5 1 1.5 2 2.5 3

-1

-0.75

-0.5

-0.25

0.25

0.5

It’s about maximizing returnThis is a bit like black magic

monitoring activity ➠

Tuesday, June 26, 12

Technique 1: text

Store changes

Tuesday, June 26, 12

Technique 2: numericStore rollups(i.e. statistical aggregates over fixed windows)

over 1 minute store

min/max/avg/stddev/covariance/50%/95%/99%

lots of information

heavy lossy compression of high-frequency data

loses population distribution information

Tuesday, June 26, 12

Database replicationLag (green) and rate of lag change (purple)

Tuesday, June 26, 12

Storage UsageWe can see growth.More useful, we can use this to project.

Tuesday, June 26, 12

Storage UsageWe can see growth.More useful, we can use this to project.

Tuesday, June 26, 12

With simple numeric data

Tuesday, June 26, 12

With simple numeric dataUnknowns can be predicted

Tuesday, June 26, 12

With simple numeric dataIn sane ways with confidence

Tuesday, June 26, 12

Full Disclosure

You see awesome examples of predictive analytics

Like the real-world one on the previous slide

In practice, almost all data streams predict one thing:

they have no fucking clue.

Tuesday, June 26, 12

Technique 3: histograms

Store histograms

over 1 minute store

counts of datapoints seen in various buckets

retains complete population distribution

loss of precision

Tuesday, June 26, 12

Histograms 101This.

This is a histogram.

It shows the frequency ofvalues within a population.

Height represents frequency

Tuesday, June 26, 12

Histograms 101This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, height and colorrepresents frequency

Tuesday, June 26, 12

This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, only colorrepresents frequency

Histograms 101

Tuesday, June 26, 12

This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, only colorrepresents frequency

Histograms 101

Tuesday, June 26, 12

This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, only colorrepresents frequency

Histograms ➠ time series

Tuesday, June 26, 12

This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, only colorrepresents frequency

Histograms ➠ time series

Tuesday, June 26, 12

This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, only colorrepresents frequency

Histograms ➠ time series

at a single time interval

Tuesday, June 26, 12

API Service TimesWe can see a full population shiftof several milliseconds

Tuesday, June 26, 12

Combining techniques

In our system (as a reference point)

Arbitrary numbers of numeric data pointson a single streamoccupy 32 bytes of space for statistical aggregates andoccupy about 2k of space for a histogram

These means we can store these transforms on numeric data in perpetuity

Tuesday, June 26, 12

Combining techniques

Text is a bit harder

You need to be careful

Some data sources can be constantly changing

Producing gobs of change data

You’re doing it wrong

Find these and fix them

Tuesday, June 26, 12

Correlating EventsChange Management vs. Performance

Tuesday, June 26, 12

Correlating EventsChange Management vs. Performance

Tuesday, June 26, 12

What to monitor?

Most people don’t monitor the things that matter most

Tuesday, June 26, 12

Monitor the Business

Financials:

Revenues. Costs. Margins. AR. Account delinquency.

Marketing:

Web analytics. Campaigns. Costs. Returns. Convergence.

Tuesday, June 26, 12

Monitor the Support

Customer Service:

Problems. Time investment. Customer satisfaction. Resolution time.

Tuesday, June 26, 12

Monitor the Engineering

Engineering:

Deployments. Test coverage.Bug reports. Bug fixes. Effort spent.

Operations:

Faults. Pages. Escalations. Provisioning time. Equipment defect rates. 3rd party failure rates.

Tuesday, June 26, 12

Monitor the Service

Systems:

Networks. Systems. Storage.

Databases:

Performance. Error rates. Backups.

Middleware:

Herein lies the magic and room for awesomeness

Tuesday, June 26, 12

Monitor the Middleware

Your systems are complex

Monitor their interactions

Messaging, APIs, etc.

Tuesday, June 26, 12

Monitor all the things.

But, perhaps most importantly...

Tuesday, June 26, 12

Monitor all the things.

But, perhaps most importantly...

USE UNIFIED TOOLING

Tuesday, June 26, 12

What we use...

reconnoiter

SNMP, nad, resmon, statsd, HTTP traps, jdbc, etc.

statsd (clients)

javascript beacons

Tuesday, June 26, 12

Middleware mixAPI service times, traffic, user signup rates.

Tuesday, June 26, 12

Tuesday, June 26, 12

Thank you!

Tuesday, June 26, 12

top related