data pipeline with kafka
TRANSCRIPT
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipelinewith Kafka
Peerapat AsoktummarungsriAGODA
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Senior Software Engineer Agoda.com
Contributor Thai Java User Group (THJUG.com)
Contributor Agile66
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
AGENDA
Big Data & Data Pipeline
Kafka Introduction
Quick Start
Monitoring
Data Pipeline for Search API
Hadoop integration with Camus
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Hadoop+
HDFS
Information
Big Data
MapReduce
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Pipeline
hadoopWebsitelog
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
hadoopWebsite
Mobile
Growth
log
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
hadoopWebsite
Mobilerealtime
monitoring
Complex
log
message
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
New
NewhadoopWebsite
Mobilerealtime
monitoring
DataWarehouse
API
Features becomes the problem
NEW
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
hadoopWebsite
Mobilerealtime
monitoring
API
Data Pipeline
Produce
Consume
Data Pipeline
Warehouse
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
compare
Topic
Queue Consumer
Consumer
Consumer
Consumer
Consumer
Consumer
1
2
3
1
1
1
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
General Topic Implement
Topic
Consumer 1
Consumer 2
Consumer 3
2
2
This consumer will lose a message.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Distributed by Design
Fast
Scalable - It can be elastically and transparently expanded without downtime.
Durable - Messages are persisted on disk and replicated within the cluster to prevent data loss.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Topic Consumer 1
Consumer 2
Consumer 3
msg
gid = Group ID
msg
msg
1
2
3
4
7
6 5
gid = hadoop
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Topic hadoop
gid = hadoop
realtime monitoring
data warehouse
msg
gid = Group ID
msg
msg
12
gid = rtmon
gid = warehouse
3
123
123
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Topic hadoop
gid = hadoop
realtime monitoring
data warehouse
msg
gid = Group ID
msg 9
gid = rtmon
gid = warehouse
9
9
New Consumer
1
2
3
gid = newconsumer
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
VagrantInstall Vagrant
Install Virtual Box
Clone https://github.com/stealthly/scala-kafka
vagrant up
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
BREWbrew update
brew install zookeeper kafka -y
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Some Kafka Config# The id of the broker. This must be set to a unique integer for each broker.
broker.id=0
# The port the socket server listens on
port=9092
# Zookeeper connection string (see zookeeper docs for details).
zookeeper.connect=localhost:2181
# Timeout in ms for connecting to zookeeper
zookeeper.connection.timeout.ms=6000
# The minimum age of a log file to be eligible for deletion
log.retention.hours=168
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Kafka @ Linkedin (2013)10 billion message writes per day
55 billion messages delivered to real-time consumers
367 topics that cover both user activity topics and operational data
the largest of which adds an average of 92GB per day of batch-compressed messages
Messages are kept for 7 days, and these average at about 9.5 TB of compressed messages across all topics.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
KafkaOffsetMonitor
java -cp KafkaOffsetMonitor-assembly-0.2.1.jar \ com.quantifind.kafka.offsetapp.OffsetGetterWeb \ --zk localhost \ --port 8080 \ --refresh 10.seconds \ --retain 2.days
Download KafkaOffsetMonitor from Github https://github.com/quantifind/KafkaOffsetMonitor
1 Jar file, KafkaOffsetMonitor-assembly-0.2.1.jar
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
CHANGE
Produce ChangePrice & Inventory
Consumer
Cassandra
Search API
Calculate Price
HTTP
KafkaAPI
Hotel Manager
Hotels
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
CHANGEKafkaAPI
Hotel Manager
HotelsB Consumer
A Consumer
Price & Inventory Consumer
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Camus
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
http://www.slideshare.net/nuboat
https://github.com/nuboat/akkakafkaexam
Slide available here
Sourcecode available here
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
REFERENCES
http://www.slideshare.net/charmalloc/developingwithapachekafka-29910685
http://www.infoq.com/articles/apache-kafka
http://kafka.apache.org/
https://github.com/stealthly/scala-kafka
https://github.com/quantifind/KafkaOffsetMonitor
Data Pipeline with Kafka by Peerapat A. is licensed under a Creative Commons Attribution 4.0 International License.
Q & A