fast > perfect: practical real-time approximations using spark streaming
TRANSCRIPT
![Page 1: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/1.jpg)
Fast > Perfect
Practical real-time approximationsusing Spark Streaming
Kevin Schmidt@kevinschmidtbiz
Luis Vicente@lvicentesanchez
![Page 2: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/2.jpg)
A Bit of Context: Mind Candy
![Page 3: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/3.jpg)
A Bit of Context: Free To Play
Sum Arbitrary Values
Count Uniques It’s Complicated
![Page 4: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/4.jpg)
A Bit of Context: Setup
![Page 5: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/5.jpg)
A Bit of Context: Requirements
• Constant storage space usage independent of number of users
• Handle delayed or duplicate data
• Error rate under 3%
![Page 6: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/6.jpg)
Counting Users: Basics
How To Count IDs Uniquely
![Page 7: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/7.jpg)
Counting Users: HyperLogLog
addIdentifier(value: String)
merge(other: HyperLogLog): HyperLogLog
zero(): HyperLogLog
countUniques(): Long
HyperLogLog
Error Rate = 1.6%Fixed Size = 4KB
14Bit Size:
12Bit Size:
Error Rate = 0.9%Fixed Size = 16KB
![Page 8: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/8.jpg)
Counting Users: DStream
![Page 9: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/9.jpg)
Counting Users: RDD
![Page 10: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/10.jpg)
Counting Users: Transform
![Page 11: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/11.jpg)
Counting Users: Storing
![Page 12: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/12.jpg)
Counting Users: Some Scala
https://github.com/lvicentesanchez/fast-gt-perfect
![Page 13: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/13.jpg)
Counting Users: Adding Up
![Page 14: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/14.jpg)
Counting Users: Performance
![Page 15: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/15.jpg)
Counting Users: Result
• Constant storage size usage for one day of data using 14bit HyperLogLogs: 288 * 16KB = 4608KB
• HyperLogLogs count users only once even if data is duplicated or repeated
• Time bucketing ensures delayed data is counted correctly
• Difference of <1% between HyperLogLogs and real count
![Page 16: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/16.jpg)
Counting Revenue: Basics
How To Sum Arbitrary Values
![Page 17: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/17.jpg)
Counting Revenue: BloomFilter
BloomFilter
Capacity = 10kError Rate = 1%Size = 11.7KB
Configurable Size:
addIdentifier(value: String)
merge(other: BloomFilter): BloomFilter
zero(): BloomFilter
contains(): Boolean
![Page 18: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/18.jpg)
Counting Revenue: Transform
![Page 19: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/19.jpg)
Counting Revenue: Transform
![Page 20: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/20.jpg)
Counting Revenue: Storing
![Page 21: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/21.jpg)
Counting Revenue: Some Scala
https://github.com/lvicentesanchez/fast-gt-perfect
![Page 22: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/22.jpg)
Counting Revenue: Adding Up
![Page 23: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/23.jpg)
Counting Revenue: Result
• Constant storage size usage for one day of data using a 10k BloomFilter: 288 * 11.7KB = 3370KB
• BloomFilter eliminates sales already counted
• Time bucketing ensures delayed data is counted correctly and keeps BloomFilters small
• Difference of <1% between approximated and real revenue
![Page 24: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/24.jpg)
Trending: Basics
How To Find the Top K
![Page 25: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/25.jpg)
Trending: StreamSummary
StreamSummary
Configurable Size:
addIdentifier(value: String)
merge(other: SS): SS
topK(k: Int): Seq[(String, Long)]Capacity = 400Max Size = 21.9KB
Metwally, Agrawal & Abbadi: Efficient Computation of Frequent and Top-k Elements in Data Streams (2005)
![Page 26: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/26.jpg)
Trending: Transform
![Page 27: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/27.jpg)
Trending: Storing
![Page 28: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/28.jpg)
Trending: Adding Up
![Page 29: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/29.jpg)
Trending: Result
• Constant storage size usage for one day of data using a Top400 StreamSummary: 288 * 21.9KB = 6307KB
• StreamSummary will not eliminate duplicates
• Time bucketing ensures delayed data is counted correctly
• Difference of <2% between StreamSummary trending items and the real trending items
![Page 30: Fast > Perfect: Practical real-time approximations using Spark Streaming](https://reader033.vdocuments.us/reader033/viewer/2022042615/55a982631a28ab6a458b46c9/html5/thumbnails/30.jpg)
Questions?
Kevin Schmidt@kevinschmidtbiz
Luis Vicente@lvicentesanchez
https://github.com/lvicentesanchez/fast-gt-perfect