benchmarking apache samza: 1.2 million messages per sec per node
TRANSCRIPT
Benchmarking Apache Samza: 1.2 million messages per sec per node
1
Tao Feng Performance Team @LinkedIn
Samza for stream processing
4
Input Stream
Task 1 Task 2 Task 3
Output Stream Changelog Stream
Local state store
Checkpoint
Container
How Samza scales within node
5
0 Task1 1 … n
Part 1
0 Task2 1 … n
Part 2
Container1
0 Task3 1 … n
Part 3
0 Task4 1 … n
Part 4
0 Task1 1 … nPart 1
Container1
0 Task2 1 … nPart 2
Container2
0 Task3 1 … nPart 3
Container3
0 Task4 1 … nPart 4
Container4
Test setup
7
0
broker
KaOa Clusters
1 … N
Container
• Use KaOa Producer to replay messages
• 10 million unique keys • Each message around
100 bytes
Container
Container
Test System
• Test System config • 24 cores • 1gbps nic • 1.65TB SSD
broker
broker
Performance metrics
• How many messages does Samza container process per sec? – Process-‐envelopes
• How long does Samza container process one message? – ProcessMs – Due to SAMZA-‐738, it is not useful in our test
• Important metrics – Message behind high watermark
8
Test scenarios
• Message Passing: reads message from source topic, writes to des\na\on topic
• Key Count: reads message , calculates the count of the message key and stores back to the store
a. with in-‐memory store b. with RocksDB store c. with RocksDB store & changelog
9
Case 2: Key coun\ng with in-‐memory store
11
• 1 million messages/sec per node • Window task runs every 90s to clean up the state. Otherwise messages will fill up the heap and trigger frequent full gc
• CPU u\liza\on of the box is around 80%
Case 3: Key coun\ng with RocksDB store
12
• 443k messages/sec per node • The test is performed without tuning on RocksDB • With SAMZA-‐449, it should give us more hint on how long messages are processed in RocksDB in the future
• CPU u\liza\on is around 84%
Case 4: Key coun\ng with RocksDB store & changelog
13
• 300k messages/sec per node • We specify linger.ms to 1 to avoid frequent send • CPU u\liza\on is around 89%
Summary
• Isolated performance test environment • Replay-‐ability of messages with KaOa producer • Founda\on of capacity model for Samza in the future
– PublicaMon: A Memory Capacity Model for High Performing Data-‐filtering Applica<ons in Samza Framework, IEEE Interna<onal Conference on Big Data 2015 – Data Quality Issues Workshop
15
Message Passing
Key CounMng w in-‐memory
Key CounMng w RocksDB
Key CounMng w RocksDB & Changelog
Messages per sec per node
1.2 millions
1 millions 443k 300k