kafka evaluation - high throughout message queue

22
Kafka Evaluation for Data Pipeline and ETL Shafaq Abdullah @ GREE Date: 10/21/2013

Upload: shafaq-abdullah

Post on 27-Jan-2015

123 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Kafka Evaluation - High Throughout Message Queue

Kafka Evaluation for Data Pipeline and ETL

Shafaq Abdullah @ GREE Date: 10/21/2013

Page 2: Kafka Evaluation - High Throughout Message Queue

High-Level Architecture

Performance

Scalability

Fault-Tolerance/Error Recovery

Operations and Monitoring

Summary

Outline

Page 3: Kafka Evaluation - High Throughout Message Queue

Kafka - A distributed messaging pub/sub system In Production: Linkedin, FourSquare, Twitter, GREE (almost)

Usage: - Log aggregation (user activity stream) - Real-time events - Monitoring

- Queuing

Intro

Page 4: Kafka Evaluation - High Throughout Message Queue

Point-to-Point Data Pipelines

Game Server 1 Logs

Hadoop

... User Tracking

Security Search Social Graph

Rules/Recommendation/Engine

Vertica Ops

DataWare House

Page 5: Kafka Evaluation - High Throughout Message Queue

Message- Central Data Channel

Game Server 1 Logs

Hadoop

... User Tracking

Security Search Social Graph

Rules/Recommendation/Engine

Vertica Ops

Data Warehouse

Message Queue

Page 6: Kafka Evaluation - High Throughout Message Queue

Topic- A String representing Message Stream Id

Partition- Logical Division per topic-level within Broker, for

writing logs generated by producer. e.g: kafkaTopic - kafkaTopic1, kafkaTopic2

Replica-Sets- Replica within Broker for a certain partition, with

Leader(writes) and follower (read) balanced using hash-key modulu.

Kafka Jargon

Page 7: Kafka Evaluation - High Throughout Message Queue

Log- Message Queue

Page 8: Kafka Evaluation - High Throughout Message Queue

Log- Message Queue

Page 9: Kafka Evaluation - High Throughout Message Queue

AWS m1.Large instance Dual Core Intel Xeon [email protected]

7.5 GB RAM

2 x 420 GB hardisk

Hardware Spec of Broker Cluster

Page 10: Kafka Evaluation - High Throughout Message Queue

Performance Results

Producer thread Transactions Processing time (s) Throughput (transaction/sec)

1 488396 64.129 7616

2 1195748 110.868 10785

4 1874713 140.375 13355

10 47269410 338.094 13981

17 7987317 568.028 14061

Page 11: Kafka Evaluation - High Throughout Message Queue

Latency vs Durability

Ack Status TIme to publish (ms) Tradeoff

No Ack 0.7 Greater data loss

Wait for ack 1.5 Lesser data loss

Page 12: Kafka Evaluation - High Throughout Message Queue

1. Create a Kafka-Replica Sink

2. Feed data to Copier via Kafka-Sink

3. Benchmark Kakfa-Copier-Vertica Pipeline

4. Improve/Refactor for Performance

Integration Plan

Page 13: Kafka Evaluation - High Throughout Message Queue

C- Consistency (Producer Sync Mode) A- Availability (Replication) P- Partition Tolerance (Cluster in same

network- No Network delay)

Strongly Consistent and Highly Available

CA-P theorem

Page 14: Kafka Evaluation - High Throughout Message Queue

Conventional Quoram Replication: 2f+1 replicas → f failures (e.g. ZooKeeper)

Kafka Replication: f+1 replicas → f failures

Failure Recovery

Page 15: Kafka Evaluation - High Throughout Message Queue
Page 16: Kafka Evaluation - High Throughout Message Queue

Leader : ➢  Message is propagated to follower ➢  Commit offset is checkpointed to disk

Follower failure and Recovery: ➢  Kicked out of ISR ➢  After restart, truncates log to last commit ➢  Catches up with leader → ISR

Error Handling: Follower Failure

Page 17: Kafka Evaluation - High Throughout Message Queue

➢ Embedded Controller via ZK detects leader failure

➢ Leader election from ISR ➢ Committed message not lost

Error Handling: Leader Failure

Page 18: Kafka Evaluation - High Throughout Message Queue

- Horizontally scalable Add partitions in Broker Cluster as higher

throughput needed (~10Mb/s /server) - Balancing for producer and consumer

Scalability

Page 19: Kafka Evaluation - High Throughout Message Queue

•  Number of messages the consumer lags behind the producer by

•  Max lag in messages btw follower and leader replica

•  Unclean leader election rate

•  Is controller active on broke

Monitoring via JMX

Page 20: Kafka Evaluation - High Throughout Message Queue

3 Node cluster ~ 25 MB/s, < 20 ms latency (end2end)

having replication factor of 3 with 6 consumer group

Monthly operations $500 + $250 (zookeeper 3 node) + 1 mm

Operation Costs

Page 21: Kafka Evaluation - High Throughout Message Queue

- No callback in producer’s send()

- Fully automatic balancing till script is manually run- Balancing layer of topics via ZK

Nothing is perfect

Page 22: Kafka Evaluation - High Throughout Message Queue

•  Infinite scaling with 10Mb/s /server with < 10 ms latency

•  Costs <$1000 /month + 1 man-month

•  No impact on current pipeline

•  0.8 final release due in days to be used in production

Summary