architecture of a kafka camus infrastructure

18
© 2013 Impetus Technologies - Confidential 1 Kafka/Camus Project Phase I Mountain View, CA March 2013 (photos courtesy of LinkedIn)

Upload: mattlieber

Post on 10-May-2015

6.474 views

Category:

Technology


0 download

DESCRIPTION

Presentation about a project done at a customer utilizing Kafka, Camus, and Hive.

TRANSCRIPT

Page 1: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential1

Kafka/Camus Project Phase I

Mountain View, CA

March 2013

(photos courtesy of LinkedIn)

Page 2: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential2

Agenda

• Objective

• What tool to use?

• Kafka & Camus overview

• Infrastructure

• Architecture

• Performance benchmarks

Page 3: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential3

Objective• Customer has events (Data, UI) that

happen real-time, that need to be analyzed

• Immediate need for batch-oriented mechanism

• Events need to by ETL’ed and analyzed in Hadoop

• Future need for more real-time stream analysis

• Potential bursts of streaming data

Page 4: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential4

What tool to use?• JMS: • just an API• Not cross language• Painful• Doesn’t scale

• Active MQ• Didn’t work for Linkedin:• http://sites.computer.org/debull/A12june/

pipeline.pdf

• Apache Flume

Page 5: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential5

Kafka overview• Distributed Scalable Pub/Sub system for

big data

• Producer -> Broker -> Consumer of message topics

• Can have multiple clients consuming at different velocities (synchronous/asynchronous)

• Notion of consumer group to parallelize consumption of messages

• Persists messages so ability to rewind

Page 6: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential6

Kafka overview

• More overview pictures:

Page 7: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential7

Camus overview• Pipeline out of Kafka to HDFS• Automatic discovery of topics and

partitions• Finds latest offsets from Kafka nodes• Uses Avro by default; option to use your

own Decoder• Allocates topic pulls among a set # of

Hadoop job tasks• Move data files to HDFS directories

according to timestamp• Remembers last offset / topic

Page 8: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential8

Infrastructure• Kafka 0.7.2• 3 nodes• Benchmark tool to issue message

size, # of threads, # of messages, topic name, data encoding

• CDH 4.2• 1 NN, 1 SNN, 3 slaves for Hadoop

• Camus• JSON or Avro decoder

• Zookeeper• Hive

Page 9: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential9

Infrastructure• 8 Amazon EC2 large instances• Dual core 2.0 Ghz• 1 7200 rpm SATA drive• 8 Gigs memory

• 200 bytes message• 1 Producer – 1 consumer

Page 10: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential10

Customer architecture

Gaming

Shopping

Invite friend

s

Consume topics

via Camus every hour

Kafka topic: Data events

(i.e. User profile

registrations)

Kafka topic:UI events (i.e. game

interaction)

Use Hive to analyze the

data

Page 11: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential11

Performance summary

• Producer: • Avg 20,000 messages / sec• 3.81 MB per sec

• Consumer:

• 16,600 messages/ sec

• 3.17 MB per sec -> 190 Gig/hr

• Customer Goal: “want to scale to 5000 events per second at peak.”

Page 12: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential12

Performance benchmark

data size input Data typeStorage size on HDFS(in bytes)

Hive Count(in sec)

Hive max(in sec) Camus run time Kafka

500000 records JSON text data 103779151 38.3 59 46 seconds 34.2JSON Serde 103779151 46.3 48.2 46 seconds 34.2Avro data 60962022 25.2 29.3 54 seconds 15.9

1 Million records JSON text data -1M 416556931 27.582 50.889 1 minute 40.56JSON Serde -1M 416556931 39.428 32.305 40.56Avro data 1M 122041553 35.806 26.328 1 minute 22.36

7 Million records JSON text data - 7M 1456636071 57.895 111.598 3 minutes 50 seconds 388

JSON Serde - 7M 1456636071 83.225 83.776 3 minutes 50 seconds 388Avro data - 7M 866962131 60.63 62.896 4 minutes 50 181

10 Million records JSON text data - 10M 1919381181 78.337 144.667 5 minutes 1 seconds 558JSON Serde - 10M 1919381181 103.4 110 5 minutes 1 seconds 558

Avro data - 10M 1239446765 87.042 90.958 7 minutes 23 seconds 230

15 Million records JSON text data - 15M 3157886975 107.325 201.125 6 minutes 24 seconds 851JSON Serde - 15M 3157886975 141.345 153.365 851

Avro data - 15M 1865267728 96.9 98.9 8 minutes 26 seconds 377

20 Million records JSON text data - 20M 1159 JSON Serde - 20M 1159

Avro data - 20M 2476833359 133.606 153.464 11 minutes 2 seconds 234

Page 13: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential13

Data Size Performance benchmark

Storage on HDFS 500000 records1 Million records

7 Million records

10 Million records 15 Million records

20 Million records

JSON text data 103779151 416556931 1456636071 1919381181 3157886975JSON Serde 103779151 416556931 1456636071 1919381181 3157886975Avro data 60962022 122041553 866962131 1239446765 1865267728 2476833359

500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records0

500000000

1000000000

1500000000

2000000000

2500000000

3000000000

3500000000

HDFS Storage size benchmark

JSON text data JSON Serde Avro data

Page 14: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential14

Kafka Speed Performance benchmark

Kafka 500000 records1 Million records

7 Million records

10 Million records

15 Million records

20 Million records

JSON text data 34.2 40.56 388 558 851 1159JSON Serde 34.2 40.56 388 558 851 1159Avro data 15.9 22.36 181 230 377 534

500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records34.2 40.56

388

558

851

1159

34.2 40.56

388

558

851

1159

15.9 22.36

181230

377

534

Kafka comparison

JSON text data JSON Serde Avro data

Page 15: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential15

Camus Speed Performance benchmark

Camus 500000 records1 Million records

7 Million records

10 Million records

15 Million records

20 Million records

JSON text data 46 60 230 301 384JSON Serde 46 60 230 301 384Avro data 54 85 290 443 506 662

500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records0

100

200

300

400

500

600

700

Camus comparison

JSON text data JSON Serde Avro data

Page 16: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential16

Count Speed Performance

Count 500000 records1 Million records

7 Million records

10 Million records

15 Million records

20 Million records

JSON text data 38.3 27.58 57.89 78.337 107.325 JSON Serde 46.3 39.42 83.2 103.4 141.345 Avro data 25.2 35.8 60.6 87.042 96.9 133.606

500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records0

20

40

60

80

100

120

140

160

Select Count(*) comparison

JSON text data JSON Serde Avro data

Page 17: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential17

Max Speed Performance

500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records0

50

100

150

200

250

Max(field) comparison

JSON text data JSON Serde Avro data

Max 500000 records1 Million records 7 Million records

10 Million records

15 Million records

20 Million records

JSON text data 59 50.889 111.598 144.667 201.125 JSON Serde 48.2 32.305 83.776 110 153.365 Avro data 29.3 26.328 62.896 90.958 98.9 153.464

Page 18: Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential18

Q&A

Thank You