insight de project
TRANSCRIPT
![Page 1: Insight DE project](https://reader031.vdocuments.us/reader031/viewer/2022030303/587c0f431a28ab03768b63eb/html5/thumbnails/1.jpg)
YeezyScoreA comparison of stream
processing software
By: Kat Chuang
@katychuang
![Page 2: Insight DE project](https://reader031.vdocuments.us/reader031/viewer/2022030303/587c0f431a28ab03768b63eb/html5/thumbnails/2.jpg)
10 mins
![Page 3: Insight DE project](https://reader031.vdocuments.us/reader031/viewer/2022030303/587c0f431a28ab03768b63eb/html5/thumbnails/3.jpg)
High level overview
Kat Chuang @katychuang
Batch
Streaming
Microbatching
Storm Trident Spark Streaming
Released 2011 2010
Delivery Semantics
Exactly Once Exactly once
State Management Yes Yes
Latency Seconds Seconds
Output MapState Resilient Distributed Dataset (RDD)
Throughput 10k/nodes/sec? 400k/nodes/sec?
![Page 4: Insight DE project](https://reader031.vdocuments.us/reader031/viewer/2022030303/587c0f431a28ab03768b63eb/html5/thumbnails/4.jpg)
Test Cases Metrics
1. Does every message pass through the pipeline?
2. How fast does each message take to process?
Data
1. Timestamps
Kat Chuang @katychuang
![Page 5: Insight DE project](https://reader031.vdocuments.us/reader031/viewer/2022030303/587c0f431a28ab03768b63eb/html5/thumbnails/5.jpg)
Timestamp1 (Timestamp1, Timestamp2)
(Timestamp1, Timestamp2)
Timestamp1
Pipelines
Kat Chuang @katychuang
![Page 6: Insight DE project](https://reader031.vdocuments.us/reader031/viewer/2022030303/587c0f431a28ab03768b63eb/html5/thumbnails/6.jpg)
1. Does every message pass through the pipeline?
Kat Chuang @katychuang
This is a scatterplot
![Page 7: Insight DE project](https://reader031.vdocuments.us/reader031/viewer/2022030303/587c0f431a28ab03768b63eb/html5/thumbnails/7.jpg)
2. How fast does each message take to process?
Kat Chuang @katychuang
This is a scatterplot
![Page 8: Insight DE project](https://reader031.vdocuments.us/reader031/viewer/2022030303/587c0f431a28ab03768b63eb/html5/thumbnails/8.jpg)
Storm Trident Vs Spark StreamingStorm Trident Spark Streaming
Stream processing framework that also does micro-batching.
Great for transforming or computing as data flows in.
Complex event processing (CEP), continuous computation.
Task-Parallel Computations, i.e. reading Twitter streams
Batch processing framework that also does micro-batching.
Great for combining with historical data.
ML algos included. Requires HDFS-backed data source.
Data-Parallel Computations, i.e. offering recommendations
![Page 9: Insight DE project](https://reader031.vdocuments.us/reader031/viewer/2022030303/587c0f431a28ab03768b63eb/html5/thumbnails/9.jpg)
Kat ChuangData Engineering Fellow#DE-2015c
[email protected]: katychuangTwitter: katychuangIG: katychuang.nyc