Download - LinkedIn-Teradata Summit feb 25, 2015
![Page 1: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/1.jpg)
Stream Processing with Samza
Navina Ramesh
DDS, Data Infrastructure
February 25, 2015
![Page 2: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/2.jpg)
Outline
• Introduction
• Use Cases at LinkedIn
• Architecture & Concepts
![Page 3: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/3.jpg)
Response latency
Milliseconds to minutes
Synchronous Later. Possibly much later.
0 ms
Stream Processing
![Page 4: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/4.jpg)
Use cases @ LinkedIn
• Data standardization platform (Project
“Waterloo”)
• Call graph assembly
• Metrics & Monitoring
![Page 5: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/5.jpg)
Call graph assembly
Map-reduce/Hadoop Samza
Filter/redirect records Mapper Repartition job
Process the grouped records Reduce Aggregation job
![Page 6: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/6.jpg)
Samza Concepts & Architecture
• Streams
• Tasks
• Jobs
• Stateful Stream Processing
![Page 7: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/7.jpg)
Streams
Partition 0 Partition 1 Partition 2
next append
123456
12345
1234567
![Page 8: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/8.jpg)
TasksPartition 0
Task 1
![Page 9: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/9.jpg)
TasksPartition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
![Page 10: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/10.jpg)
TasksPartition 0
class PageKeyViewsCounterTask implements StreamTask {
public void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
GenericRecord record = ((GenericRecord) envelope.getMsg());
String pageKey = record.get("page-key").toString();
int newCount = pageKeyViews.get(pageKey).incrementAndGet();
collector.send(countStream, pageKey, newCount);
}
}
![Page 11: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/11.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
Page Views - Partition 0
1234
Output Count Stream
![Page 12: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/12.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
Page Views - Partition 0
1234
Output Count Stream
![Page 13: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/13.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
Page Views - Partition 0
1234
Output Count Stream
![Page 14: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/14.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
Page Views - Partition 0
1234
Output Count Stream
![Page 15: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/15.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
Page Views - Partition 0
1234
Output Count Stream
![Page 16: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/16.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
Page Views - Partition 0
1234
Output Count Stream
![Page 17: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/17.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
Page Views - Partition 0
1234
Output Count Stream
![Page 18: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/18.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
Page Views - Partition 0
1234
Output Count Stream
![Page 19: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/19.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
![Page 20: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/20.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
![Page 21: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/21.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0 Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
![Page 22: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/22.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
![Page 23: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/23.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
![Page 24: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/24.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
![Page 25: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/25.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
![Page 26: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/26.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
![Page 27: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/27.jpg)
Tasks
PageKeyViewsCounterTask
Partition 0
Partition 1
1234
2
Partition 1Checkpoint
Stream
Page Views - Partition 0
Output Count Stream
![Page 28: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/28.jpg)
JobsAdViews AdClicks
Task 1 Task 2 Task 3
AdClickThroughRate
![Page 29: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/29.jpg)
JobsAdViews AdClicks
Task 1 Task 2 Task 3
AdClickThroughRate
![Page 30: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/30.jpg)
Stream Processing is Hard
• Partitioning
• Re-processing
• Failure semantics
• State
• Joins to services or database
• Non-determinism
![Page 31: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/31.jpg)
Stream Processing is Hard
• Partitioning
• Re-processing
• Failure semantics
• State
• Joins to services or database
• Non-determinism
![Page 32: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/32.jpg)
Jobs
AdViews AdClicks
Task 1 Task 2 Task 3
AdClickThroughRate
SELECTAdViews.id,COUNT(AdViews) views,COUNT(AdClicks) clicks,clicks/views ctr
FROMAdViews
LEFT JOINAdClicks
WHEREAdViews.id = AdClicks.id
GROUP BY id
![Page 33: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/33.jpg)
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B
![Page 34: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/34.jpg)
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
![Page 35: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/35.jpg)
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
![Page 36: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/36.jpg)
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
![Page 37: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/37.jpg)
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
![Page 38: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/38.jpg)
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
![Page 39: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/39.jpg)
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
![Page 40: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/40.jpg)
Stateful TasksStream A
Task 1 Task 2 Task 3
Stream B Changelog Stream
![Page 41: LinkedIn-Teradata Summit feb 25, 2015](https://reader033.vdocuments.us/reader033/viewer/2022042817/55a8fff71a28ab457c8b45c1/html5/thumbnails/41.jpg)
Resources
• What’s next ?
– Support for SQL operators over streams
– Samza without YARN
• Get involved:
– Apache – http://samza.apache.org
– Dev Mailing List – [email protected]
– JIRA -
https://issues.apache.org/jira/browse/SAMZA