t digest-update

52
© 2017 MapR Technologies 1 Update on t-digest

Upload: ted-dunning

Post on 22-Jan-2018

448 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: T digest-update

© 2017 MapR Technologies 1

Update on t-digest

Page 2: T digest-update

© 2017 MapR Technologies 2

Contact Information

Ted Dunning, PhD

Chief Application Architect, MapR Technologies

Board member, Apache Software Foundation

O’Reilly author

Email [email protected] [email protected]

Twitter @ted_dunning

Page 3: T digest-update

© 2017 MapR Technologies 3

Who We Are

• MapR echnologies

– We make a kick-ass platform for big data computing

– Support many workloads including Hadoop / Spark / HPC / Other

– Extended to allow streams and tables in basic platform

– Free for academic research / training

• Apache Software Foundation

– Culture hub for building open source communities

– Shared values around openness for contribution as well as use

– Many major projects are part of Apache

– Even more minor ones!

Page 4: T digest-update

© 2017 MapR Technologies 4

Basic Outline

• Why we should measure distributions

• Basic Ideas

• How t-digest works

• Recent results

• Applications

Page 5: T digest-update

© 2017 MapR Technologies 5

Why Is This Practically Important

• The novice came to the master and says “something is broken”

Page 6: T digest-update

© 2017 MapR Technologies 6

Why Is This Practically Important

• The novice came to the master and says “something is broken”

• The master replied “What has changed?”

Page 7: T digest-update

© 2017 MapR Technologies 7

Why Is This Practically Important

• The novice came to the master and says “something is broken”

• The master replied “What has changed?”

• And the student was enlightened

Page 8: T digest-update

© 2017 MapR Technologies 8

Finding change is key

but what kind?

Page 9: T digest-update

© 2017 MapR Technologies 9

Last Night’s Latencies

• These are ping latencies from my hotel

• Looks pretty good, right?

• But what about longer term?

208.302198.571185.099191.258201.392214.738197.389187.749201.693186.762185.296186.390183.960188.060190.763

> mean(y$t[i])[1] 198.6047> sd(y$t[i])[1] 71.43965

Page 10: T digest-update

© 2017 MapR Technologies 10

Not So Fast …

Page 11: T digest-update

© 2017 MapR Technologies 11

This is long-tailed land

Page 12: T digest-update

© 2017 MapR Technologies 12

This is long-tailed land

You have to know the distribution

of values

Page 13: T digest-update

© 2017 MapR Technologies 13

Page 14: T digest-update

© 2017 MapR Technologies 14

A single number

is simply not enough

Page 15: T digest-update

© 2017 MapR Technologies 15

What We Really Need Here

• I want to be able to compute the distribution from any time

period

• From any subset of measurements

• With lots of keys and filters

• And not a lot of space

• Basically, any OLAP kind of query

select distribution(x) from … where … group by y,z

Page 16: T digest-update

© 2017 MapR Technologies 16

Idea 0 – Pre-defined bins

• So let’s assume we have bins– Upper, lower bound, constant width

• Get a measurement, pick a bin, increment count

• Works great if you know the data– And you have limited dynamic range (too many bins)

– And the distribution is fixed

• Useful, but not general enough

Page 17: T digest-update

© 2017 MapR Technologies 17

Idea 1 – Exponential Bins

• Suppose we want relative accuracy in measurement space

• Latencies are positive and only matter within a few percent

– 1.1 ms versus 1.0 ms

– 1100 ms versus 1000 ms

• We can cheat by using floating point representations

– Compute bin using magic

– Count

Page 18: T digest-update

© 2017 MapR Technologies 18

FloatHistogram

• Assume all measurements are in the range

• Divide this range into power of 2 sub-ranges

• Sub-divide each sub-range evenly with steps

– is typical

• Relative error is bounded in measurement space

Page 19: T digest-update

© 2017 MapR Technologies 19

FloatHistogram

• Assume all measurements are in the range

• Divide this range into power of 2 sub-ranges

• Sub-divide each sub-range evenly with steps

– is typical

• Relative error is bounded in measurement space

• Bin index can be computed using FP representation!

Page 20: T digest-update

© 2017 MapR Technologies 20

Fixed Size Bins

Page 21: T digest-update

© 2017 MapR Technologies 21

Approximate Exponential Bins

Page 22: T digest-update

© 2017 MapR Technologies 22

Non-linear bins are better

(sometimes)

Still not general enough

Page 23: T digest-update

© 2017 MapR Technologies 23

Idea 2 – Fully Adaptive Bins

• First intuition – in general, we want accuracy in terms of

percentile

• Second intuition – we want better accuracy at extreme

quantiles

– 50%-ile versus 50.1%-ile?

– What does 0.1% error even mean for 99.99th percentile

• We need bins with small counts near the edges

Page 24: T digest-update

© 2017 MapR Technologies 24

First 1% of data shown.

Left graph has 100 x 100 sample bins.

Right graph has ~130bins, variable size

Page 25: T digest-update

© 2017 MapR Technologies 25

The Basic t-digest

• Take a bunch of data

• Sort it

• Group into bins

– But make the bins be smaller at the beginning and end

• Remember the centroid and count of each bin

• That’s a t-digest

Page 26: T digest-update

© 2017 MapR Technologies 26

But Wait, You Need a Bit More

• Take a bunch of new data, old t-digest

• Sort the data and the old bins together

• Group into bins

– Note that existing bins have bigger weights

– So they might survive … or might clump

• Remember the centroid and count of each new bin

• That’s an updated t-digest

Page 27: T digest-update

© 2017 MapR Technologies 27

Oh … and Merging

• Take a bunch of old t-digests

• Sort the bins

• Group into mega-bins

– Respect the size constraint

• Remember the centroid and count of each new bin

• That’s a merged t-digest

Page 28: T digest-update

© 2017 MapR Technologies 28

Adaptive non-linear bins are good

and general

And can be grouped

and regrouped

Page 29: T digest-update

© 2017 MapR Technologies 29

Results

Page 30: T digest-update

© 2017 MapR Technologies 30

Page 31: T digest-update

© 2017 MapR Technologies 31

Status

• Current release

– Small accuracy bugs in corner cases

– Best overall is still AVLTreeDigest

Page 32: T digest-update

© 2017 MapR Technologies 32

Status

• Current release (3.x)

– Small accuracy bugs in corner cases

– Best overall is still AVLTreeDigest

• Upcoming release (4.0)

– Better accuracy in pathological cases

– Strictly bounded size

– No dynamic allocation (with MergingDigest)

– Good speed (100ns for MergingDigest, 5ns for FloatHistogram)

– Real Soon Now

Page 33: T digest-update

© 2017 MapR Technologies 33

Example Application

• The data:– ~ 1 million machines

– Even more services

– Each producing thousands of measurements per second

• Store t-digest for each 5 minute period for each measurement

• Want to query any combination of keys, produce t-digest result

“what was the distribution of launch times yesterday?”

“what about last month?”

“in Europe versus in North America versus in Asia?”

Page 34: T digest-update

© 2017 MapR Technologies 34

Collect Data

log consolidator

web server

web server

Web-server

Log

Web-server

Log

log_events

log-stash

log-stash

data center

Page 35: T digest-update

© 2017 MapR Technologies 35

And Transport to Global Analytics

log consolidator

web server

web server

Web-server

Log

Web-server

Log

log_events

log-stash

log-stash

data center GHQ

log_events

eventsElaborate

events(log-stash)

Aggregate

Signal detection

Page 36: T digest-update

© 2017 MapR Technologies 36

With Many Sources

log consolidator

web server

web server

Web-server

Log

Web-server

Log

log_events

log-stash

log-stash

data center GHQ

log_events

eventsElaborate

events(log-stash)

Aggregate

Signal detection

Page 37: T digest-update

© 2017 MapR Technologies 37

With Many Sources

log consolidator

web server

web server

Web-server

Log

Web-server

Log

log_events

log-stash

log-stash

data center GHQ

log_events

eventsElaborate

events(log-stash)

Aggregate

Signal detection

log consolidator

web server

Web-server

Log

web server

Web-server

Log

log_events

log-stash

log-stash

data center

Page 38: T digest-update

© 2017 MapR Technologies 38

With Many Sources

log consolidator

web server

web server

Web-server

Log

Web-server

Log

log_events

log-stash

log-stash

data center GHQ

log_events

eventsElaborate

events(log-stash)

Aggregate

Signal detection

log consolidator

web server

Web-server

Log

web server

Web-server

Log

log_events

log-stash

log-stash

data center

log consolidator

web server

Web-server

Log

web server

Web-server

Log

log_events

log-stash

log-stash

data center

Page 39: T digest-update

© 2017 MapR Technologies 39

What about visualization?

Page 40: T digest-update

© 2017 MapR Technologies 40

Can’t see small count bars

Page 41: T digest-update

© 2017 MapR Technologies 41

Good Results

Page 42: T digest-update

© 2017 MapR Technologies 42

Bad Results – 1% of measurements are 3x bigger

Page 43: T digest-update

© 2017 MapR Technologies 43

Bad Results – 1% of measurements are 3x bigger

Page 44: T digest-update

© 2017 MapR Technologies 44

With Better Vertical Scaling

Page 45: T digest-update

© 2017 MapR Technologies 45

Uniform Bins

Page 46: T digest-update

© 2017 MapR Technologies 46

FloatHistogram Bins

Page 47: T digest-update

© 2017 MapR Technologies 47

With FloatHistogram

Page 48: T digest-update

© 2017 MapR Technologies 48

Original Ping Latency Data

Page 49: T digest-update

© 2017 MapR Technologies 49

Summary

• Single measurements insufficient, need distributions

• Uniform binned histograms not good

• FloatHistogram for some cases

• T-digest for general cases

• Upcoming release has super-

fast and accurate versions

• Good visualization also key

0.0 0.2 0.4 0.6 0.8 1.0q

02

46

81

0k

Page 50: T digest-update

© 2017 MapR Technologies 50

Q & A

Page 51: T digest-update

© 2017 MapR Technologies 51

Contact Information

Ted Dunning, PhD

Chief Application Architect, MapR Technologies

Board member, Apache Software Foundation

O’Reilly author

Email [email protected] [email protected]

Twitter @ted_dunning

Page 52: T digest-update

© 2017 MapR Technologies 52

T-digest

• Or we can talk about small errors in q

• Accumulate samples, sort, merge

• Merge if k-size < 1

• Interpolate using centroids in x

• Very good near extremes, no dynamic allocation

0.0 0.2 0.4 0.6 0.8 1.0q

02

46

81

0k