©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
About Protectwise
● Cloud security platform, that aims to make threats actionable and obvious.
● Aims to cut down on the amount of “noise” that a network can create, and only show the most important details.
● Has a big emphasis on real time data.
● Ingests and processes terabytes of data a day.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Goals Of Count Sumula
● Quick report generation.
● Support high cardinality data.
● Compute averages, min, and max.
● Easy to add additional aggregations.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Challenge: Daily Data Ingestion
● 2 billion netflow updates.
● Ingests 20TB of raw network traffic.
● Generates 150 million observations.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Challenge: Costs of Processing Data.
● Traditional batch processing is accurate, but slow.
○ We want results in seconds not hours or days.
● Compute resources are very expensive at our scale.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Challenge: Making a Great User Experience
● A user should expect:
○ Hardly any waiting for report generate.
○ Up to date reports.
○ Meaningful reports that are actionable and concise.
○ Reports that are persisted forever and can be recombined after the fact to gain additional insights.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Some Use Cases
● Show me a count all of the hosts that had a threat on them in the past year.
● Show me the hosts with the most threats encountered over the course of a year.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Use Cases Examined
● Show me a count all of the hosts that had a threat on them in the past year.
○ IP address has a very high cardinality 340 undecillion (ipv6)
■Or: 340,282,366,920,938,463,463,374,607,431,768,211,456 (WOW!)
○ Storage costs could be high.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Use Cases Examined Continued
● Show me the hosts with the most threats encountered over the course of a year.
○ Once again, high cardinality.
○ Same storage costs as the example before, but now we have to sort, which is going to be tough. O(n log n).
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Considerations For Our Solution
● Be real time.
● Could not grow without bounds.
● Data must be around for decades or more.
● Be able to return queries for large time ranges.
● Be actionable and concise.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
The Realization
● In general users can live with an approximate result!
○ Approximate results use less space.
○ Can be computed in memory.
○ Approximate results can be bounded by trading accuracy for space
○ Approximate results are fast enough to compute in real time.
○ Meets two of our goals.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Some Approximations We Used
● HyperLogLog
● Count Min Sketch
● Stream Summary
● Bloom Filter
● Layered Bloom Filter
● Compound Approximations
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
HyperLogLog
● Only counts the amount of consecutive 0 bits.
● Uses the count of consecutive 0 bits and the probability of it occurring to determine an estimate of unique elements seen.
● Assumes a good hashing function (Murmur 3).
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Example: HyperLogLog
Assuming our hashing function only returns 4 bits (16 combinations).
If our greatest number of consecutive 0 bits is 4 then we assume we have seen 16 elements
If our greatest number of consecutive 0 bits is 3 then we assume we have seen 8 elements
Cribbed from: http://blog.kiip.me/engineering/sketching-scali
ng-everyday-hyperloglog/
Bit pattern(s) Chance of occurrence
0000 1 / 16
1000, 0001 2 / 16 or 1 / 8
0011,1001,1100,0100,0010 5 / 16
0111,1011,1101,1110,1010,0110 7 / 16
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
CountMinSketch
● Essentially a matrix.
● Inserts are duplicated across rows.
● Inserts are hashed differently per row.
● Elements can only add.
● Used for frequency estimation.
● Can be used for averages, min, max as well.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Example: CountMinSketch
Inserting an element
“Ben”
“Eric”
1 null null null null
null null 1 null null
1 null 1 null null
null null 2 null null
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Example: CountMinSketch Continued
Retrieving the count for “Ben”
“Ben” 1 null 1 null null
null null 2 null null
Compare the values return, and take the min, in this case 1.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
How Did We Store The Approximations?
● We generate enough approximations that we create about 1 GB of data each month.
○ Much better than the amount stored for full fidelity data.
● First approach just use Redis.
● Second approach Redis and Cassandra.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
First Approach Redis Only
Advantages
● Easy
● Fast
Disadvantages
● Ticking time bomb since Redis is memory only.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Second Approach C* And Redis
Advantages
● C* scales infinitely.
● Redis can be used when speed is important.
● Not a ticking time bomb.
Disadvantages
● Not as easy as previous solution.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
How We Use Redis With Cassandra
● Elements are placed in Redis and keyed on bucket name and time.
● Once a element from the next time interval is encountered, data is moved from Redis to Cassandra.
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates{“bucket”: “observation”,”time”:1, “value”: 1}
{“bucket”: “observation”,”time”:1, “value”: 2}
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates{“bucket”: “observation”,”time”:1, “value”: 2}
{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis{“bucket”: “observation”,”time”:1, “value”: 1}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis{“bucket”: “observation”,”time”:1, “value”: 1}
{“bucket”: “observation”,”time”:1, “value”: 2}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis{“bucket”: “observation”,”time”:1, “value”: 1}
{“bucket”: “observation”,”time”:1, “value”: 2}
Elements are summed
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates{“bucket”: “observation”,”time”:2, “value”: 10}
Cassandra
Redis
{“bucket”: “observation”,”time”:1, “value”: 3}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
Cassandra
Redis{“bucket”: “observation”,”time”:2, “value”: 10}
{“bucket”: “observation”,”time”:1, “value”: 3}
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Incoming Updates
Cassandra{“bucket”: “observation”,”time”:1, “value”: 3}
Redis{“bucket”: “observation”,”time”:2, “value”: 10}
Element from time 1 is determined to be expired and written to Cassandra
Cassandra SchemaCREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
Cassandra SchemaCREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
Cassandra SchemaCREATE TABLE buckets ( name text, // bucket name time_bucket timestamp, // Time floored on next interval up. time_unit int, // {1: “minute”, 2: “hour”, 3: “day” } algorithm text, // [HyperLogLog, CountMinSketch, etc] time timestamp, // the actual time d blob, //Serialized data PRIMARY KEY ((name, time_bucket, time_unit, algorithm), time)
©2016 ProtectWise, Inc. All rights reserved. Proprietary & Confidential.
Advantages of using Cassandra and Redis
● Elements are written in their finalized form to Cassandra.
○ Compactor friendly.
● Updates can happen very fast since Redis is Fast.
● Redis no longer consumes memory unbounded.
Caveats
● Using approximations are just that, approximate.
● Takes time to understand how they work.
● Tuning needs up front knowledge of usage.
https://www.protectwise.com/careers.html
Especially if you’re in Denver!
We’re Hiring!