Download - Architecture of a Kafka camus infrastructure
© 2013 Impetus Technologies - Confidential1
Kafka/Camus Project Phase I
Mountain View, CA
March 2013
(photos courtesy of LinkedIn)
© 2013 Impetus Technologies - Confidential2
Agenda
• Objective
• What tool to use?
• Kafka & Camus overview
• Infrastructure
• Architecture
• Performance benchmarks
© 2013 Impetus Technologies - Confidential3
Objective• Customer has events (Data, UI) that
happen real-time, that need to be analyzed
• Immediate need for batch-oriented mechanism
• Events need to by ETL’ed and analyzed in Hadoop
• Future need for more real-time stream analysis
• Potential bursts of streaming data
© 2013 Impetus Technologies - Confidential4
What tool to use?• JMS: • just an API• Not cross language• Painful• Doesn’t scale
• Active MQ• Didn’t work for Linkedin:• http://sites.computer.org/debull/A12june/
pipeline.pdf
• Apache Flume
© 2013 Impetus Technologies - Confidential5
Kafka overview• Distributed Scalable Pub/Sub system for
big data
• Producer -> Broker -> Consumer of message topics
• Can have multiple clients consuming at different velocities (synchronous/asynchronous)
• Notion of consumer group to parallelize consumption of messages
• Persists messages so ability to rewind
© 2013 Impetus Technologies - Confidential6
Kafka overview
• More overview pictures:
© 2013 Impetus Technologies - Confidential7
Camus overview• Pipeline out of Kafka to HDFS• Automatic discovery of topics and
partitions• Finds latest offsets from Kafka nodes• Uses Avro by default; option to use your
own Decoder• Allocates topic pulls among a set # of
Hadoop job tasks• Move data files to HDFS directories
according to timestamp• Remembers last offset / topic
© 2013 Impetus Technologies - Confidential8
Infrastructure• Kafka 0.7.2• 3 nodes• Benchmark tool to issue message
size, # of threads, # of messages, topic name, data encoding
• CDH 4.2• 1 NN, 1 SNN, 3 slaves for Hadoop
• Camus• JSON or Avro decoder
• Zookeeper• Hive
© 2013 Impetus Technologies - Confidential9
Infrastructure• 8 Amazon EC2 large instances• Dual core 2.0 Ghz• 1 7200 rpm SATA drive• 8 Gigs memory
• 200 bytes message• 1 Producer – 1 consumer
© 2013 Impetus Technologies - Confidential10
Customer architecture
Gaming
Shopping
Invite friend
s
Consume topics
via Camus every hour
Kafka topic: Data events
(i.e. User profile
registrations)
Kafka topic:UI events (i.e. game
interaction)
Use Hive to analyze the
data
© 2013 Impetus Technologies - Confidential11
Performance summary
• Producer: • Avg 20,000 messages / sec• 3.81 MB per sec
• Consumer:
• 16,600 messages/ sec
• 3.17 MB per sec -> 190 Gig/hr
• Customer Goal: “want to scale to 5000 events per second at peak.”
© 2013 Impetus Technologies - Confidential12
Performance benchmark
data size input Data typeStorage size on HDFS(in bytes)
Hive Count(in sec)
Hive max(in sec) Camus run time Kafka
500000 records JSON text data 103779151 38.3 59 46 seconds 34.2JSON Serde 103779151 46.3 48.2 46 seconds 34.2Avro data 60962022 25.2 29.3 54 seconds 15.9
1 Million records JSON text data -1M 416556931 27.582 50.889 1 minute 40.56JSON Serde -1M 416556931 39.428 32.305 40.56Avro data 1M 122041553 35.806 26.328 1 minute 22.36
7 Million records JSON text data - 7M 1456636071 57.895 111.598 3 minutes 50 seconds 388
JSON Serde - 7M 1456636071 83.225 83.776 3 minutes 50 seconds 388Avro data - 7M 866962131 60.63 62.896 4 minutes 50 181
10 Million records JSON text data - 10M 1919381181 78.337 144.667 5 minutes 1 seconds 558JSON Serde - 10M 1919381181 103.4 110 5 minutes 1 seconds 558
Avro data - 10M 1239446765 87.042 90.958 7 minutes 23 seconds 230
15 Million records JSON text data - 15M 3157886975 107.325 201.125 6 minutes 24 seconds 851JSON Serde - 15M 3157886975 141.345 153.365 851
Avro data - 15M 1865267728 96.9 98.9 8 minutes 26 seconds 377
20 Million records JSON text data - 20M 1159 JSON Serde - 20M 1159
Avro data - 20M 2476833359 133.606 153.464 11 minutes 2 seconds 234
© 2013 Impetus Technologies - Confidential13
Data Size Performance benchmark
Storage on HDFS 500000 records1 Million records
7 Million records
10 Million records 15 Million records
20 Million records
JSON text data 103779151 416556931 1456636071 1919381181 3157886975JSON Serde 103779151 416556931 1456636071 1919381181 3157886975Avro data 60962022 122041553 866962131 1239446765 1865267728 2476833359
500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records0
500000000
1000000000
1500000000
2000000000
2500000000
3000000000
3500000000
HDFS Storage size benchmark
JSON text data JSON Serde Avro data
© 2013 Impetus Technologies - Confidential14
Kafka Speed Performance benchmark
Kafka 500000 records1 Million records
7 Million records
10 Million records
15 Million records
20 Million records
JSON text data 34.2 40.56 388 558 851 1159JSON Serde 34.2 40.56 388 558 851 1159Avro data 15.9 22.36 181 230 377 534
500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records34.2 40.56
388
558
851
1159
34.2 40.56
388
558
851
1159
15.9 22.36
181230
377
534
Kafka comparison
JSON text data JSON Serde Avro data
© 2013 Impetus Technologies - Confidential15
Camus Speed Performance benchmark
Camus 500000 records1 Million records
7 Million records
10 Million records
15 Million records
20 Million records
JSON text data 46 60 230 301 384JSON Serde 46 60 230 301 384Avro data 54 85 290 443 506 662
500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records0
100
200
300
400
500
600
700
Camus comparison
JSON text data JSON Serde Avro data
© 2013 Impetus Technologies - Confidential16
Count Speed Performance
Count 500000 records1 Million records
7 Million records
10 Million records
15 Million records
20 Million records
JSON text data 38.3 27.58 57.89 78.337 107.325 JSON Serde 46.3 39.42 83.2 103.4 141.345 Avro data 25.2 35.8 60.6 87.042 96.9 133.606
500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records0
20
40
60
80
100
120
140
160
Select Count(*) comparison
JSON text data JSON Serde Avro data
© 2013 Impetus Technologies - Confidential17
Max Speed Performance
500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records0
50
100
150
200
250
Max(field) comparison
JSON text data JSON Serde Avro data
Max 500000 records1 Million records 7 Million records
10 Million records
15 Million records
20 Million records
JSON text data 59 50.889 111.598 144.667 201.125 JSON Serde 48.2 32.305 83.776 110 153.365 Avro data 29.3 26.328 62.896 90.958 98.9 153.464
© 2013 Impetus Technologies - Confidential18
Q&A
Thank You