data pipeline with kafka

33
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License. Data Pipeline with Kafka Peerapat Asoktummarungsri AGODA

Upload: peerapat-asoktummarungsri

Post on 07-Jan-2017

764 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Data Pipelinewith Kafka

Peerapat AsoktummarungsriAGODA

Page 2: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Senior Software Engineer Agoda.com

Contributor Thai Java User Group (THJUG.com)

Contributor Agile66

Page 3: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

AGENDA

Big Data & Data Pipeline

Kafka Introduction

Quick Start

Monitoring

Data Pipeline for Search API

Hadoop integration with Camus

Page 4: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Hadoop+

HDFS

Information

Big Data

MapReduce

Page 5: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Pipeline

hadoopWebsitelog

Page 6: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

hadoopWebsite

Mobile

Growth

log

Page 7: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

hadoopWebsite

Mobilerealtime

monitoring

Complex

log

message

Page 8: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

New

NewhadoopWebsite

Mobilerealtime

monitoring

DataWarehouse

API

Features becomes the problem

NEW

Page 9: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

hadoopWebsite

Mobilerealtime

monitoring

API

Data Pipeline

Produce

Consume

Data Pipeline

Warehouse

Page 10: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

compare

Topic

Queue Consumer

Consumer

Consumer

Consumer

Consumer

Consumer

1

2

3

1

1

1

Page 11: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

General Topic Implement

Topic

Consumer 1

Consumer 2

Consumer 3

2

2

This consumer will lose a message.

Page 12: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Distributed by Design

Fast

Scalable - It can be elastically and transparently expanded without downtime.

Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss.

Page 13: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Topic Consumer 1

Consumer 2

Consumer 3

msg

gid = Group ID

msg

msg

1

2

3

4

7

6 5

gid = hadoop

Page 14: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Topic hadoop

gid = hadoop

realtime monitoring

data warehouse

msg

gid = Group ID

msg

msg

12

gid = rtmon

gid = warehouse

3

123

123

Page 15: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Topic hadoop

gid = hadoop

realtime monitoring

data warehouse

msg

gid = Group ID

msg 9

gid = rtmon

gid = warehouse

9

9

New Consumer

1

2

3

gid = newconsumer

Page 16: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 17: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

VagrantInstall Vagrant

Install Virtual Box

Clone https://github.com/stealthly/scala-kafka

vagrant up

Page 18: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

BREWbrew update

brew install zookeeper kafka -y

Page 19: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Some Kafka Config# The id of the broker. This must be set to a unique integer for each broker.

broker.id=0

# The port the socket server listens on

port=9092

# Zookeeper connection string (see zookeeper docs for details).

zookeeper.connect=localhost:2181

# Timeout in ms for connecting to zookeeper

zookeeper.connection.timeout.ms=6000

# The minimum age of a log file to be eligible for deletion

log.retention.hours=168

Page 20: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Kafka @ Linkedin (2013)10 billion message writes per day

55 billion messages delivered to real-time consumers

367 topics that cover both user activity topics and operational data

the largest of which adds an average of 92GB per day of batch-compressed messages

Messages are kept for 7 days, and these average at about 9.5 TB of compressed messages across all topics.

Page 21: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

KafkaOffsetMonitor

java -cp KafkaOffsetMonitor-assembly-0.2.1.jar \ com.quantifind.kafka.offsetapp.OffsetGetterWeb \ --zk localhost \ --port 8080 \ --refresh 10.seconds \ --retain 2.days

Download KafkaOffsetMonitor from Github https://github.com/quantifind/KafkaOffsetMonitor

1 Jar file, KafkaOffsetMonitor-assembly-0.2.1.jar

Page 22: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 23: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 24: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 25: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 26: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 27: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Page 28: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

CHANGE

Produce ChangePrice & Inventory

Consumer

Cassandra

Search API

Calculate Price

HTTP

KafkaAPI

Hotel Manager

Hotels

Page 29: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

CHANGEKafkaAPI

Hotel Manager

HotelsB Consumer

A Consumer

Price & Inventory Consumer

Page 30: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Camus

Page 31: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

http://www.slideshare.net/nuboat

https://github.com/nuboat/akkakafkaexam

Slide available here

Sourcecode available here

Page 32: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

REFERENCES

http://www.slideshare.net/charmalloc/developingwithapachekafka-29910685

http://www.infoq.com/articles/apache-kafka

http://kafka.apache.org/

https://github.com/stealthly/scala-kafka

https://github.com/quantifind/KafkaOffsetMonitor

Page 33: Data Pipeline with Kafka

Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.

Q & A