using approximate data for small, insightful analytics (ben kornmeier, protectwise) | cassandra...

34
Using approximate data structures for small, insightful analytics. Ben Kornmeier, Engineer

Upload: datastax

Post on 16-Apr-2017

176 views

Category:

Software


0 download

TRANSCRIPT

Using approximate data structures for small, insightful analytics.

Ben Kornmeier, Engineer

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

About Protectwise

● Cloud security platform, that aims to make threats actionable and obvious.

● Aims to cut down on the amount of “noise” that a network can create, and only show the most important details.

● Has a big emphasis on real time data.

● Ingests and processes terabytes of data a day.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Goals Of Count Sumula

● Quick report generation.

● Support high cardinality data.

● Compute averages, min, and max.

● Easy to add additional aggregations.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Challenge: Daily Data Ingestion

● 2 billion netflow updates.

● Ingests 20TB of raw network traffic.

● Generates 150 million observations.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Challenge: Costs of Processing Data.

● Traditional batch processing is accurate, but slow.

○ We want results in seconds not hours or days.

● Compute resources are very expensive at our scale.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Challenge: Making a Great User Experience

● A user should expect:

○ Hardly any waiting for report generate.

○ Up to date reports.

○ Meaningful reports that are actionable and concise.

○ Reports that are persisted forever and can be recombined after the fact to gain additional insights.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Some Use Cases

● Show me a count all of the hosts that had a threat on them in the past year.

● Show me the hosts with the most threats encountered over the course of a year.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Use Cases Examined

● Show me a count all of the hosts that had a threat on them in the past year.

○ IP address has a very high cardinality 340 undecillion (ipv6)

■Or: 340,282,366,920,938,463,463,374,607,431,768,211,456 (WOW!)

○ Storage costs could be high.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Use Cases Examined Continued

● Show me the hosts with the most threats encountered over the course of a year.

○ Once again, high cardinality.

○ Same storage costs as the example before, but now we have to sort, which is going to be tough. O(n log n).

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Considerations For Our Solution

● Be real time.

● Could not grow without bounds.

● Data must be around for decades or more.

● Be able to return queries for large time ranges.

● Be actionable and concise.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

The Realization

● In general users can live with an approximate result!

○ Approximate results use less space.

○ Can be computed in memory.

○ Approximate results can be bounded by trading accuracy for space

○ Approximate results are fast enough to compute in real time.

○ Meets two of our goals.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Some Approximations We Used

● HyperLogLog

● Count Min Sketch

● Stream Summary

● Bloom Filter

● Layered Bloom Filter

● Compound Approximations

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

HyperLogLog

● Only counts the amount of consecutive 0 bits.

● Uses the count of consecutive 0 bits and the probability of it occurring to determine an estimate of unique elements seen.

● Assumes a good hashing function (Murmur 3).

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Example: HyperLogLog

Assuming our hashing function only returns 4 bits (16 combinations).

If our greatest number of consecutive 0 bits is 4 then we assume we have seen 16 elements

If our greatest number of consecutive 0 bits is 3 then we assume we have seen 8 elements

Cribbed from: http://blog.kiip.me/engineering/sketching-scali

ng-everyday-hyperloglog/

Bit pattern(s) Chance of occurrence

0000 1 / 16

1000, 0001 2 / 16 or 1 / 8

0011,1001,1100,0100,0010 5 / 16

0111,1011,1101,1110,1010,0110 7 / 16

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

CountMinSketch

● Essentially a matrix.

● Inserts are duplicated across rows.

● Inserts are hashed differently per row.

● Elements can only add.

● Used for frequency estimation.

● Can be used for averages, min, max as well.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Example: CountMinSketch

Inserting an element

“Ben”

“Eric”

1 null null null null

null null 1 null null

1 null 1 null null

null null 2 null null

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Example: CountMinSketch Continued

Retrieving the count for “Ben”

“Ben” 1 null 1 null null

null null 2 null null

Compare the values return, and take the min, in this case 1.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

How Did We Store The Approximations?

● We generate enough approximations that we create about 1 GB of data each month.

○ Much better than the amount stored for full fidelity data.

● First approach just use Redis.

● Second approach Redis and Cassandra.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

First Approach Redis Only

Advantages

● Easy

● Fast

Disadvantages

● Ticking time bomb since Redis is memory only.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Second Approach C* And Redis

Advantages

● C* scales infinitely.

● Redis can be used when speed is important.

● Not a ticking time bomb.

Disadvantages

● Not as easy as previous solution.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

How We Use Redis With Cassandra

● Elements are placed in Redis and keyed on bucket name and time.

● Once a element from the next time interval is encountered, data is moved from Redis to Cassandra.

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Incoming Updates{“bucket”: “observation”,”time”:1, “value”: 1}

{“bucket”: “observation”,”time”:1, “value”: 2}

{“bucket”: “observation”,”time”:2, “value”: 10}

Cassandra

Redis

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Incoming Updates{“bucket”: “observation”,”time”:1, “value”: 2}

{“bucket”: “observation”,”time”:2, “value”: 10}

Cassandra

Redis{“bucket”: “observation”,”time”:1, “value”: 1}

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Incoming Updates{“bucket”: “observation”,”time”:2, “value”: 10}

Cassandra

Redis{“bucket”: “observation”,”time”:1, “value”: 1}

{“bucket”: “observation”,”time”:1, “value”: 2}

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Incoming Updates{“bucket”: “observation”,”time”:2, “value”: 10}

Cassandra

Redis{“bucket”: “observation”,”time”:1, “value”: 1}

{“bucket”: “observation”,”time”:1, “value”: 2}

Elements are summed

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Incoming Updates{“bucket”: “observation”,”time”:2, “value”: 10}

Cassandra

Redis

{“bucket”: “observation”,”time”:1, “value”: 3}

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Incoming Updates

Cassandra

Redis{“bucket”: “observation”,”time”:2, “value”: 10}

{“bucket”: “observation”,”time”:1, “value”: 3}

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Incoming Updates

Cassandra{“bucket”: “observation”,”time”:1, “value”: 3}

Redis{“bucket”: “observation”,”time”:2, “value”: 10}

Element from time 1 is determined to be expired and written to Cassandra

Cassandra SchemaCREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)

Cassandra SchemaCREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)

Cassandra SchemaCREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)

©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.

Advantages of using Cassandra and Redis

● Elements are written in their finalized form to Cassandra.

○ Compactor friendly.

● Updates can happen very fast since Redis is Fast.

● Redis no longer consumes memory unbounded.

Caveats

● Using approximations are just that, approximate.

● Takes time to understand how they work.

● Tuning needs up front knowledge of usage.

https://www.protectwise.com/careers.html

Especially if you’re in Denver!

We’re Hiring!